จัดการหมวดหมู่เล็ก ๆ ยิบย่อย รวมข้อมูลหมวดหมู่ Category เล็ก ๆ เป็นหมวดหมู่ Other ก่อนป้อนเทรน Machine Learning - Preprocessing ep.4

ในหลาย ๆ Dataset เราจะพบว่าข้อมูลแบบ Category มีการแตกยิบย่อยมากเกินไป เช่น บาง Category มีแค่ 1 หรือ 2 Record เท่านั้น หรือ Category เล็ก จำนวน Record แตกต่างกับ Category ใหญ่ ๆ หลายร้อย หลายพันเท่า ข้อมูล Category เล็ก ๆ ยิบย่อยเหล่านี้ อาจจะไม่ได้ช่วยโมเดล Machine Learning ในการเรียนรู้ก็ได้

ทางแก้คือ เราจะ Group รวม Category เล็ก ๆ เหล่านั้นรวมออกมาเป็น Category ใหม่ ตั้งชื่อว่า Other

An example of a power law graph showing popularity ranking. To the right (yellow) is the long tail; to the left (green) are the few that dominate. In this example, the cutoff is chosen so that areas of both regions are equal. Credit https://commons.wikimedia.org/wiki/File:Long_tail.svg

Other (และอื่น ๆ)

การสร้าง Other Category มีข้อดีอีกอย่าง คือ ถ้าเวลาใช้งานจริง มีข้อมูล Category ใหม่ หลุดเข้ามา เราอาจจะเอาใส่ไว้ใน Other ได้เลย โดยที่ไม่ต้องแก้โปรแกรมเยอะ

และยิ่งถ้า Category นั่นใช้ One-Hot Encoding ถ้าเรามี Category ยิบย่อยจำนวนมาก เช่น หลักพัน จะทำให้ต้องเพิ่มจำนวน Column อีกหลักพัน เท่าจำนวนหมวดหมู่ ทำให้โมเดลอาจจะมีปัญหาได้

เรามาเริ่มกันเลยดีกว่า

Check it out on github Last updated: 28/02/2024 04:27:02

แชร์ให้เพื่อน:

Surapong Kanoktipsatharporn

Solutions Architect at Bua Labs

The ultimate test of your knowledge is your capacity to convey it to another.

จัดการหมวดหมู่เล็ก ๆ ยิบย่อย รวมข้อมูลหมวดหมู่ Category เล็ก ๆ เป็นหมวดหมู่ Other ก่อนป้อนเทรน Machine Learning – Preprocessing ep.4

Other (และอื่น ๆ)

เรามาเริ่มกันเลยดีกว่า

แชร์ให้เพื่อน:

Published by Surapong Kanoktipsatharporn

Other (และอื่น ๆ)

เรามาเริ่มกันเลยดีกว่า

แชร์ให้เพื่อน:

บทความที่เกี่ยวข้อง:

Published by Surapong Kanoktipsatharporn