เมื่อเราได้ Dataset ใหม่มา สิ่งแรกที่เราควรทำ คือ Exploratory Data Analysis (EDA) ทำความเข้าใจข้อมูล ในแต่ละ Feaure เช่น ข้อมูลเป็นชนิดอะไร, ข้อมูลเป็นแบบต่อเนื่องหรือไม่ต่อเนื่อง, ช่วงของข้อมูลกว้างแค่ไหน, การกระจายของข้อมูลเป็นอย่างไร, มีข้อมูลขาดหายไปเยอะแค่ไหน, แต่ละ Feature เชื่อมโยงกันอย่างไร

การวิเคราะห์ทั้งหมดนี้ค่อนข้างซับซ้อน และซ้ำซ้อนเหมือนกันในทุก ๆ Dataset จะมีวิธีไหนที่จะทำให้งานซ้ำ ๆ เหล่านี้ง่ายขึ้น

Pandas DataFrame.describe()

ในการวิเคราะห์ข้อมูลแบบตาราง ปกติเราจะใช้ Pandas DataFrame และฟังก์ชันแรก ๆ ที่เราจะใช้ดูภาพรวมของตารางข้อมูล ก็คือ datarame.describe()

แต่ข้อเสียของ describe คือ output ออกมาเป็นตารางเดียวง่าย ๆ ที่มีข้อมูลน้อยเกินไป ทำให้เราต้องเขียนโปรแกรมเพิ่ม เพื่อเปรียบเทียบ เชื่อมโยงข้อมูลที่เราอยากรู้เอง ซึ่งก็เป็นงานซ้ำ ๆ กันที่ต้องทำคล้าย ๆ กันในทุก ๆ Dataset

Correlations of Adult Data Set
Correlations of Adult Data Set

Pandas Profiling

เปรียบเทียบกับ DataFrame.describe() ที่ Output ออกมาเป็นตารางเดียวง่าย ๆ แต่ Pandas Profiling จะวิเคราะห์ข้อมูล Feature ต่าง ๆ ใน DataFrame แล้วจัดทำ Report เต็มรูปแบบ Output ออกมาเป็น HTML เป็นกราฟสวยงาม มีรายละเอียดดังนี้

  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Missing values matrix, count, heatmap and dendrogram of missing values

เรามาเริ่มกันเลยดีกว่า

Open In Colab

0. Magic

In [0]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

1. Import

Install pandas_profiling ถ้ายังไม่ได้ Install

In [0]:
#!pip install pandas_profiling
In [0]:
import pandas as pd
import pandas_profiling as pp

from fastai import *
from fastai.tabular import *

2. Data

In [0]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

3. Explore Data

สำรวจข้อมูล ดูความสัมพันธ์เชื่อมโยงระหว่างข้อมูล ก่อนที่จะจัดเตรียมข้อมูล ป้อนให้กับโมเดลต่อไป

In [0]:
profile = pp.ProfileReport(df)
profile
Out[0]:

Overview

Dataset info

Number of variables15
Number of observations32561
Total Missing (%)0.2%
Total size in memory3.7 MiB
Average record size in memory120.0 B

Variables types

Numeric6
Categorical9
Boolean0
Date0
Text (Unique)0
Rejected0
Unsupported0

Warnings

Variables

age
Numeric

Distinct count73
Unique (%)0.2%
Missing (%)0.0%
Missing (n)0
Infinite (%)0.0%
Infinite (n)0
Mean38.582
Minimum17
Maximum90
Zeros (%)0.0%

Quantile statistics

Minimum17
5-th percentile19
Q128
Median37
Q348
95-th percentile63
Maximum90
Range73
Interquartile range20

Descriptive statistics

Standard deviation13.64
Coef of variation0.35355
Kurtosis-0.16613
Mean38.582
MAD11.189
Skewness0.55874
Sum1256257
Variance186.06
Memory size254.5 KiB
ValueCountFrequency (%) 
368982.8%
 
318882.7%
 
348862.7%
 
238772.7%
 
358762.7%
 
338752.7%
 
288672.7%
 
308612.6%
 
378582.6%
 
258412.6%
 
Other values (63)2383473.2%
 

Minimum 5 values

ValueCountFrequency (%) 
173951.2%
 
185501.7%
 
197122.2%
 
207532.3%
 
217202.2%
 

Maximum 5 values

ValueCountFrequency (%) 
8530.0%
 
8610.0%
 
8710.0%
 
8830.0%
 
90430.1%
 

capital-gain
Numeric

Distinct count119
Unique (%)0.4%
Missing (%)0.0%
Missing (n)0
Infinite (%)0.0%
Infinite (n)0
Mean1077.6
Minimum0
Maximum99999
Zeros (%)91.7%

Quantile statistics

Minimum0
5-th percentile0
Q10
Median0
Q30
95-th percentile5013
Maximum99999
Range99999
Interquartile range0

Descriptive statistics

Standard deviation7385.3
Coef of variation6.8532
Kurtosis154.8
Mean1077.6
MAD1977.4
Skewness11.954
Sum35089324
Variance54543000
Memory size254.5 KiB
ValueCountFrequency (%) 
02984991.7%
 
150243471.1%
 
76882840.9%
 
72982460.8%
 
999991590.5%
 
5178970.3%
 
3103970.3%
 
4386700.2%
 
5013690.2%
 
8614550.2%
 
Other values (109)12884.0%
 

Minimum 5 values

ValueCountFrequency (%) 
02984991.7%
 
11460.0%
 
40120.0%
 
594340.1%
 
91480.0%
 

Maximum 5 values

ValueCountFrequency (%) 
25236110.0%
 
27828340.1%
 
3409550.0%
 
4131020.0%
 
999991590.5%
 

capital-loss
Numeric

Distinct count92
Unique (%)0.3%
Missing (%)0.0%
Missing (n)0
Infinite (%)0.0%
Infinite (n)0
Mean87.304
Minimum0
Maximum4356
Zeros (%)95.3%

Quantile statistics

Minimum0
5-th percentile0
Q10
Median0
Q30
95-th percentile0
Maximum4356
Range4356
Interquartile range0

Descriptive statistics

Standard deviation402.96
Coef of variation4.6156
Kurtosis20.377
Mean87.304
MAD166.46
Skewness4.5946
Sum2842700
Variance162380
Memory size254.5 KiB
ValueCountFrequency (%) 
03104295.3%
 
19022020.6%
 
19771680.5%
 
18871590.5%
 
1848510.2%
 
1485510.2%
 
2415490.2%
 
1602470.1%
 
1740420.1%
 
1590400.1%
 
Other values (82)7102.2%
 

Minimum 5 values

ValueCountFrequency (%) 
03104295.3%
 
15510.0%
 
21340.0%
 
32330.0%
 
41930.0%
 

Maximum 5 values

ValueCountFrequency (%) 
300420.0%
 
368320.0%
 
377020.0%
 
390020.0%
 
435630.0%
 

education
Categorical

Distinct count16
Unique (%)0.0%
Missing (%)0.0%
Missing (n)0
HS-grad
10501
Some-college
7291
Bachelors
5355
Other values (13)
9414
ValueCountFrequency (%) 
HS-grad1050132.3%
 
Some-college729122.4%
 
Bachelors535516.4%
 
Masters17235.3%
 
Assoc-voc13824.2%
 
11th11753.6%
 
Assoc-acdm10673.3%
 
10th9332.9%
 
7th-8th6462.0%
 
Prof-school5761.8%
 
Other values (6)19125.9%
 

education-num
Numeric

Distinct count17
Unique (%)0.1%
Missing (%)1.5%
Missing (n)487
Infinite (%)0.0%
Infinite (n)0
Mean10.08
Minimum1
Maximum16
Zeros (%)0.0%

Quantile statistics

Minimum1
5-th percentile5
Q19
Median10
Q312
95-th percentile14
Maximum16
Range15
Interquartile range3

Descriptive statistics

Standard deviation2.573
Coef of variation0.25526
Kurtosis0.62843
Mean10.08
MAD1.9024
Skewness-0.31347
Sum323300
Variance6.6203
Memory size254.5 KiB
ValueCountFrequency (%) 
9.01034931.8%
 
10.0718422.1%
 
13.0527716.2%
 
14.016925.2%
 
11.013654.2%
 
7.011533.5%
 
12.010493.2%
 
6.09162.8%
 
4.06402.0%
 
15.05651.7%
 
Other values (6)18845.8%
 

Minimum 5 values

ValueCountFrequency (%) 
1.0510.2%
 
2.01660.5%
 
3.03281.0%
 
4.06402.0%
 
5.05061.6%
 

Maximum 5 values

ValueCountFrequency (%) 
12.010493.2%
 
13.0527716.2%
 
14.016925.2%
 
15.05651.7%
 
16.04081.3%
 

fnlwgt
Numeric

Distinct count21648
Unique (%)66.5%
Missing (%)0.0%
Missing (n)0
Infinite (%)0.0%
Infinite (n)0
Mean189780
Minimum12285
Maximum1484705
Zeros (%)0.0%

Quantile statistics

Minimum12285
5-th percentile39460
Q1117830
Median178360
Q3237050
95-th percentile379680
Maximum1484705
Range1472420
Interquartile range119220

Descriptive statistics

Standard deviation105550
Coef of variation0.55617
Kurtosis6.2188
Mean189780
MAD77608
Skewness1.447
Sum6179373392
Variance11141000000
Memory size254.5 KiB
ValueCountFrequency (%) 
203488130.0%
 
123011130.0%
 
164190130.0%
 
113364120.0%
 
121124120.0%
 
148995120.0%
 
126675120.0%
 
111483110.0%
 
155659110.0%
 
190290110.0%
 
Other values (21638)3244199.6%
 

Minimum 5 values

ValueCountFrequency (%) 
1228510.0%
 
1376910.0%
 
1487810.0%
 
1882710.0%
 
1921410.0%
 

Maximum 5 values

ValueCountFrequency (%) 
122658310.0%
 
126833910.0%
 
136612010.0%
 
145543510.0%
 
148470510.0%
 

hours-per-week
Numeric

Distinct count94
Unique (%)0.3%
Missing (%)0.0%
Missing (n)0
Infinite (%)0.0%
Infinite (n)0
Mean40.437
Minimum1
Maximum99
Zeros (%)0.0%

Quantile statistics

Minimum1
5-th percentile18
Q140
Median40
Q345
95-th percentile60
Maximum99
Range98
Interquartile range5

Descriptive statistics

Standard deviation12.347
Coef of variation0.30535
Kurtosis2.9167
Mean40.437
MAD7.5832
Skewness0.22764
Sum1316684
Variance152.46
Memory size254.5 KiB
ValueCountFrequency (%) 
401521746.7%
 
5028198.7%
 
4518245.6%
 
6014754.5%
 
3512974.0%
 
2012243.8%
 
3011493.5%
 
556942.1%
 
256742.1%
 
485171.6%
 
Other values (84)567117.4%
 

Minimum 5 values

ValueCountFrequency (%) 
1200.1%
 
2320.1%
 
3390.1%
 
4540.2%
 
5600.2%
 

Maximum 5 values

ValueCountFrequency (%) 
9520.0%
 
9650.0%
 
9720.0%
 
98110.0%
 
99850.3%
 

marital-status
Categorical

Distinct count7
Unique (%)0.0%
Missing (%)0.0%
Missing (n)0
Married-civ-spouse
14976
Never-married
10683
Divorced
4443
Other values (4)
 
2459
ValueCountFrequency (%) 
Married-civ-spouse1497646.0%
 
Never-married1068332.8%
 
Divorced444313.6%
 
Separated10253.1%
 
Widowed9933.0%
 
Married-spouse-absent4181.3%
 
Married-AF-spouse230.1%
 

native-country
Categorical

Distinct count42
Unique (%)0.1%
Missing (%)0.0%
Missing (n)0
United-States
29170
Mexico
 
643
?
 
583
Other values (39)
 
2165
ValueCountFrequency (%) 
United-States2917089.6%
 
Mexico6432.0%
 
?5831.8%
 
Philippines1980.6%
 
Germany1370.4%
 
Canada1210.4%
 
Puerto-Rico1140.4%
 
El-Salvador1060.3%
 
India1000.3%
 
Cuba950.3%
 
Other values (32)12944.0%
 

occupation
Categorical

Distinct count16
Unique (%)0.0%
Missing (%)1.6%
Missing (n)512
Prof-specialty
4073
Craft-repair
 
4028
Exec-managerial
 
4009
Other values (12)
19939
ValueCountFrequency (%) 
Prof-specialty407312.5%
 
Craft-repair402812.4%
 
Exec-managerial400912.3%
 
Adm-clerical372011.4%
 
Sales359011.0%
 
Other-service324710.0%
 
Machine-op-inspct19686.0%
 
?18205.6%
 
Transport-moving15664.8%
 
Handlers-cleaners13474.1%
 
Other values (5)26818.2%
 

race
Categorical

Distinct count5
Unique (%)0.0%
Missing (%)0.0%
Missing (n)0
White
27816
Black
 
3124
Asian-Pac-Islander
 
1039
Other values (2)
 
582
ValueCountFrequency (%) 
White2781685.4%
 
Black31249.6%
 
Asian-Pac-Islander10393.2%
 
Amer-Indian-Eskimo3111.0%
 
Other2710.8%
 

relationship
Categorical

Distinct count6
Unique (%)0.0%
Missing (%)0.0%
Missing (n)0
Husband
13193
Not-in-family
8305
Own-child
5068
Other values (3)
5995
ValueCountFrequency (%) 
Husband1319340.5%
 
Not-in-family830525.5%
 
Own-child506815.6%
 
Unmarried344610.6%
 
Wife15684.8%
 
Other-relative9813.0%
 

salary
Categorical

Distinct count2
Unique (%)0.0%
Missing (%)0.0%
Missing (n)0
24720
>=50k
7841
ValueCountFrequency (%) 
2472075.9%
 
>=50k784124.1%
 

sex
Categorical

Distinct count2
Unique (%)0.0%
Missing (%)0.0%
Missing (n)0
Male
21790
Female
10771
ValueCountFrequency (%) 
Male2179066.9%
 
Female1077133.1%
 

workclass
Categorical

Distinct count9
Unique (%)0.0%
Missing (%)0.0%
Missing (n)0
Private
22696
Self-emp-not-inc
 
2541
Local-gov
 
2093
Other values (6)
5231
ValueCountFrequency (%) 
Private2269669.7%
 
Self-emp-not-inc25417.8%
 
Local-gov20936.4%
 
?18365.6%
 
State-gov12984.0%
 
Self-emp-inc11163.4%
 
Federal-gov9602.9%
 
Without-pay140.0%
 
Never-worked70.0%
 

Correlations

Sample

ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
049Private101320Assoc-acdm12.0Married-civ-spouseNaNWifeWhiteFemale0190240United-States>=50k
144Private236746Masters14.0DivorcedExec-managerialNot-in-familyWhiteMale10520045United-States>=50k
238Private96185HS-gradNaNDivorcedNaNUnmarriedBlackFemale0032United-States<50k
338Self-emp-inc112847Prof-school15.0Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040United-States>=50k
442Self-emp-not-inc822977th-8thNaNMarried-civ-spouseOther-serviceWifeBlackFemale0050United-States<50k

ดูรายการ Variable ที่ถูก Reject

In [0]:
rejected_variables = profile.get_rejected_variables(threshold=0.9)
rejected_variables
Out[0]:
[]

3.1 Save Report

เราสามารถ Save Profile Report เอาไว้ดูวันหลัง หรือเอาไปใช้งานนำเสนออื่น ๆ ได้

In [0]:
profile.to_file("output.html")

Credit

In [0]:
 

แชร์ให้เพื่อน:

Keng Surapong on FacebookKeng Surapong on GithubKeng Surapong on Linkedin
Keng Surapong
Project Manager at Bua Labs
The ultimate test of your knowledge is your capacity to convey it to another.

Published by Keng Surapong

The ultimate test of your knowledge is your capacity to convey it to another.

Enable Notifications.    Ok No thanks