什么是样本不平衡
import pandas as pd
import numpy as np
import seaborn as sns
values = {"姓名":["A","B","C","D","E","F","G","H","I","J","K","L","G","H","I","J","K","L"],
"年龄":[55,70,80,90,60,30,67,44,60,30,67,44,30,67,30,67,30,67],
"头发颜色":["白","白","白","白","白","黑","白","黑","白","白","黑","白","白","黑","白","白","黑","黑"]}
table = pd.DataFrame(values)
table
|
姓名 |
年龄 |
头发颜色 |
0 |
A |
55 |
白 |
1 |
B |
70 |
白 |
2 |
C |
80 |
白 |
3 |
D |
90 |
白 |
4 |
E |
60 |
白 |
5 |
F |
30 |
黑 |
6 |
G |
67 |
白 |
7 |
H |
44 |
黑 |
8 |
I |
60 |
白 |
9 |
J |
30 |
白 |
10 |
K |
67 |
黑 |
11 |
L |
44 |
白 |
12 |
G |
30 |
白 |
13 |
H |
67 |
黑 |
14 |
I |
30 |
白 |
15 |
J |
67 |
白 |
16 |
K |
30 |
黑 |
17 |
L |
67 |
黑 |
table["头发颜色"] = pd.Categorical(table["头发颜色"]).codes
table
|
姓名 |
年龄 |
头发颜色 |
0 |
A |
55 |
0 |
1 |
B |
70 |
0 |
2 |
C |
80 |
0 |
3 |
D |
90 |
0 |
4 |
E |
60 |
0 |
5 |
F |
30 |
1 |
6 |
G |
67 |
0 |
7 |
H |
44 |
1 |
8 |
I |
60 |
0 |
9 |
J |
30 |
0 |
10 |
K |
67 |
1 |
11 |
L |
44 |
0 |
12 |
G |
30 |
0 |
13 |
H |
67 |
1 |
14 |
I |
30 |
0 |
15 |
J |
67 |
0 |
16 |
K |
30 |
1 |
17 |
L |
67 |
1 |
- 从下面的统计图中可以看出,以头发颜色作为 label 进行分类的时候,样本是不均衡的
- 因为 12个白头发,但是有 6 个黑头发
table["头发颜色"].plot(x=[0,1],kind="hist")

如何平衡数据集的样本——重采样
- 我们的最终目标是保证数据集中各个 label 下的样本数量是几乎完全相等的
- 要么我们就需要把样本多的组的样本按照随机的原则砍掉一部分来平衡,要么就把少样本的一组进行扩充
欠采样(也叫 undersampling)
将大的样本集的数据全部筛选出来
df_white = table.loc[table["头发颜色"] == 0]
df_black = table.loc[table["头发颜色"] == 1]
df_white
|
姓名 |
年龄 |
头发颜色 |
0 |
A |
55 |
0 |
1 |
B |
70 |
0 |
2 |
C |
80 |
0 |
3 |
D |
90 |
0 |
4 |
E |
60 |
0 |
6 |
G |
67 |
0 |
8 |
I |
60 |
0 |
9 |
J |
30 |
0 |
11 |
L |
44 |
0 |
12 |
G |
30 |
0 |
14 |
I |
30 |
0 |
15 |
J |
67 |
0 |
df_black
|
姓名 |
年龄 |
头发颜色 |
5 |
F |
30 |
1 |
7 |
H |
44 |
1 |
10 |
K |
67 |
1 |
13 |
H |
67 |
1 |
16 |
K |
30 |
1 |
17 |
L |
67 |
1 |
通过随机采样操作采样固定个数的样本留下
df_white = df_white.sample(n=6,random_state=30)
df_white
|
姓名 |
年龄 |
头发颜色 |
0 |
A |
55 |
0 |
8 |
I |
60 |
0 |
12 |
G |
30 |
0 |
11 |
L |
44 |
0 |
1 |
B |
70 |
0 |
3 |
D |
90 |
0 |
和少样本的样本集拼合成最终的样本集
table_undersampling = pd.concat([df_black,df_white],axis=0,ignore_index=True)
table_undersampling
|
姓名 |
年龄 |
头发颜色 |
0 |
F |
30 |
1 |
1 |
H |
44 |
1 |
2 |
K |
67 |
1 |
3 |
H |
67 |
1 |
4 |
K |
30 |
1 |
5 |
L |
67 |
1 |
6 |
A |
55 |
0 |
7 |
I |
60 |
0 |
8 |
G |
30 |
0 |
9 |
L |
44 |
0 |
10 |
B |
70 |
0 |
11 |
D |
90 |
0 |
样本均衡了
table_undersampling["头发颜色"].plot(kind="hist")

过采样(over-sampling)
通过 imblearn 库扩充小的样本集
from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy='auto', random_state=7)
oversampled_data,oversampled_label=sm.fit_resample(table.drop(['姓名','头发颜色'], axis=1), table['头发颜色'])
oversampled_table =pd.concat([oversampled_data, oversampled_label], axis=1)
样本均衡了
oversampled_table["头发颜色"].plot(kind="hist")

所有评论(0)