Data Pipeline คืออะไร Data Block API สร้าง Data Pipeline สำหรับเทรน Machine Learning แบบ Supervised Learning - Preprocessing ep.5

ในการเทรน Machine Learning โดยเฉพาะแบบ Supervised Learning หรือข้อมูลมี Label นอกจากเรื่องการเทรน การออกแบบสถาปัตยกรรมของโมเดล ยังมีงานสำคัญอีกหลายที่ต้องทำก่อนที่เราจะเริ่มเทรนได้ หนึ่งในนั้นคือ สร้าง Data Pipeline จัดเตรียมข้อมูล

Data Pipeline คืออะไร

Data Pipeline คือ การจัดเตรียมข้อมูล ให้อยู่ในรูปแบบที่เหมาะสม ป้อนให้โมเดล Machine Learning นำไปใช้ได้ ตั้งแต่ต้นทางไม่ว่าจะเป็น ไฟล์รูปภาพ ไฟล์ข้อความ ไฟล์เสียง ไฟล์วิดีโอ ข้อมูลตาราง Tabular มีขั้นตอนดังนี้

List All Examples / Get Files – ดึงรายการข้อมูลทั้งหมดใน Dataset (ชื่อไฟล์)
- tfms – พ่วงด้วย Transform ที่จำเป็น
Split to Training Set, Validation Set – แบ่งข้อมูลออกเป็น Training Set, Validation Set
- by Random %, Folder name, CSV, … – ด้วยวิธีต่าง ๆ เช่น Random %, ตามไฟล์เดอร์, ตามที่ระบุในไฟล์ CSV, …
Label – แปะ Label ให้กับข้อมูล สำหรับงาน Supervised Learning
- Folder name, File name, CSV, … – จากชื่อโฟลเดอร์, ชื่อไฟล์, ตามที่ระบุใน CSV, etc. โดย Label ของ Validation Set จะขึ้นกับ Training Set
Transform (Optional) – แปลงข้อมูล โดย Transform ของ Validation Set จะขึ้นกับ Training Set
- per Example/Image – ต่อ 1 ตัวอย่าง เช่น แปลง Channel รูป, Resize รูป, etc.
- per Training Set – Normalize, Fill N/A wtih Median, Categorize, Tokenize, Numericalize, etc.
To Tensor – แปลงเป็น Tensor เนื่องจาก PyTorch รับ Tensor
DataLoader to Batch – เราไม่สามารถโหลดทั้ง Dataset ได้พร้อมกัน เราจำเป็นต้องใช้ DataLoader สับไพ่ข้อมูล (Shuffle) และแบ่งข้อมูลออกเป็น Batch (Lazy Loading)
Transform per Batch – แปลงข้อมูล ต่อ Batch
DataBunch – สร้าง DataBunch ห่อ Training Set, Validation Set
Add Test Set (Optional) – เพิ่มข้อมูล Test Set (ถ้ามี)

A common metal short-link chain. Credit: https://commons.wikimedia.org/wiki/File:Broad_chain_closeup.jpg

เราจะสร้าง API เป็นแบบ Method Chaining ให้สามารถ dot method ต่อกันไปได้เรื่อย ๆ อย่างยืดหยุ่น

data = #Where to find the data? -> in path and its subfolders
       (ImageList.from_folder(path)     
        #How to split in train/valid? -> use the folders
        .split_by_folder()              
        #How to label? -> depending on the folder of the filenames
        .label_from_folder()            
        #Optionally add a test set (here default name is test)
        .add_test_folder()              
        #Data augmentation? -> use tfms with a size of 64
        .transform(tfms, size=64)       
        #Finally? -> use the defaults for conversion to ImageDataBunch
        .databunch())

เรามาเริ่มกันเลยดีกว่า

Check it out on github Last updated: 01/06/2026 10:47:00

แชร์ให้เพื่อน:

Surapong Kanoktipsatharporn

CTO at Bua Labs

The ultimate test of your knowledge is your capacity to convey it to another.

Data Pipeline คืออะไร Data Block API สร้าง Data Pipeline สำหรับเทรน Machine Learning แบบ Supervised Learning – Preprocessing ep.5