3.7 高级处理–数据离散化

  • 目标
    • 应用cut、qcut实现数据的区间分组
    • 应用get_dummies实现数据的one-hot编码
  • 内容预览
    • 3.7.1 什么是数据的离散化
    • 3.7.2 为什么要离散化
    • 3.7.3 如何实现数据的离散化

3.7.1 什么是数据的离散化

连续属性的离散化就是在连续属性的值域上,将值域划分为若干个离散的区间,最后用不同的符号或整数值代表落在每个子区间中的属性值
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cFUxvyId-1586742259769)(attachment:image.png)]

非离散化数据:

性别 年龄
A 1 23
B 2 30
C 1 18

非离散化数据:

物种 毛发
A 1
B 2
C 3

数据的离散化


one-hot编码/哑变量:


年龄
A 1 0 23
B 0 1 30
C 1 0 18

3.7.2 为什么要离散化

连续属性数据的离散化是为了简化数据结构,数据离散化技术可以用来减少给定连续属性值的个数。离散化数据经常作为数据挖掘的工具

3.7.3 如何实现数据的离散化

流程:

  1. 对数据进行分组
    • 自动分组:pd.qcut(data, bins) # bins为分组的组数,返回一个Series。
    • 自定义分组:pd.cut(data, []) # []中为设置好的分组区间,返回一个Series。
    • 对数据进行分组一般会与value_counts搭配使用
      • Series.value_counts():统计分组次数
  2. 对分好组的数据求哑变量
    • pd.get_dummies(data, prefix=None)
      • data:array-like, Series or DataFrame
      • prefix:分组名字
# 1)准备数据
import pandas as pd
data = pd.Series([165, 174, 160, 180, 159, 163, 192, 184], index=['NO1:165', 'NO2:174', 'NO3:160', 'NO4:180', 'NO5:159', 'NO6:163', 'NO7:192', 'NO8:184'])
data
NO1:165    165
NO2:174    174
NO3:160    160
NO4:180    180
NO5:159    159
NO6:163    163
NO7:192    192
NO8:184    184
dtype: int64
# 2)分组
# 自动分组
sr = pd.qcut(data, 3)
sr
type(sr)
NO1:165      (163.667, 178.0]
NO2:174      (163.667, 178.0]
NO3:160    (158.999, 163.667]
NO4:180        (178.0, 192.0]
NO5:159    (158.999, 163.667]
NO6:163    (158.999, 163.667]
NO7:192        (178.0, 192.0]
NO8:184        (178.0, 192.0]
dtype: category
Categories (3, interval[float64]): [(158.999, 163.667] < (163.667, 178.0] < (178.0, 192.0]]

pandas.core.series.Series
# 统计分组次数
sr.value_counts()
(178.0, 192.0]        3
(158.999, 163.667]    3
(163.667, 178.0]      2
dtype: int64
# 3)转换成哑变量
pd.get_dummies(sr, prefix='height')
height_(158.999, 163.667] height_(163.667, 178.0] height_(178.0, 192.0]
NO1:165 0 1 0
NO2:174 0 1 0
NO3:160 1 0 0
NO4:180 0 0 1
NO5:159 1 0 0
NO6:163 1 0 0
NO7:192 0 0 1
NO8:184 0 0 1
# 2)分组
# 自定义分组
bins = [150, 165, 180, 195]
sr2 = pd.cut(data, bins)
sr2
NO1:165    (150, 165]
NO2:174    (165, 180]
NO3:160    (150, 165]
NO4:180    (165, 180]
NO5:159    (150, 165]
NO6:163    (150, 165]
NO7:192    (180, 195]
NO8:184    (180, 195]
dtype: category
Categories (3, interval[int64]): [(150, 165] < (165, 180] < (180, 195]]
sr2.value_counts()
(150, 165]    4
(180, 195]    2
(165, 180]    2
dtype: int64
pd.get_dummies(sr2, prefix='身高')
身高_(150, 165] 身高_(165, 180] 身高_(180, 195]
NO1:165 1 0 0
NO2:174 0 1 0
NO3:160 1 0 0
NO4:180 0 1 0
NO5:159 1 0 0
NO6:163 1 0 0
NO7:192 0 0 1
NO8:184 0 0 1

案例:股票的涨跌幅离散化

# 1)读取数据
import pandas as pd
stock = pd.read_excel('stock.xls')
stock
trade_date close open high low pre_close change pct_chg vol amount
0 20200313 2887.4265 2804.2322 2910.8812 2799.9841 2923.4856 -36.0591 -1.2334 366450436.0 3.930197e+08
1 20200312 2923.4856 2936.0163 2944.4651 2906.2838 2968.5174 -45.0318 -1.5170 307778457.0 3.282092e+08
2 20200311 2968.5174 3001.7616 3010.0286 2968.5174 2996.7618 -28.2444 -0.9425 352470970.0 3.787666e+08
3 20200310 2996.7618 2918.9347 3000.2963 2904.7989 2943.2907 53.4711 1.8167 393296648.0 4.250172e+08
4 20200309 2943.2907 2987.1805 2989.2051 2940.7138 3034.5113 -91.2206 -3.0061 414560736.0 4.381439e+08
... ... ... ... ... ... ... ... ... ... ...
6997 19910719 136.7000 137.6600 138.5400 136.6600 137.1700 -0.4700 -0.3426 10823.0 5.242826e+03
6998 19910718 137.1700 137.1700 137.1700 135.8100 135.8100 1.3600 1.0014 847.0 4.644160e+02
6999 19910717 135.8100 135.8100 135.8100 135.3900 134.4700 1.3400 0.9965 660.0 3.975240e+02
7000 19910716 134.4700 134.3900 134.4700 133.1400 133.1400 1.3300 0.9989 2796.0 1.328502e+03
7001 19910715 133.1400 133.9000 134.1000 131.8700 132.8000 0.3400 0.2560 11938.0 5.534900e+03

7002 rows × 10 columns

change = stock['change']
change
0      -36.0591
1      -45.0318
2      -28.2444
3       53.4711
4      -91.2206
         ...   
6997    -0.4700
6998     1.3600
6999     1.3400
7000     1.3300
7001     0.3400
Name: change, Length: 7002, dtype: float64
# 2)分组
# 自动分组
sr3 = pd.qcut(change, 10)
sr3
0       (-354.685, -33.319]
1       (-354.685, -33.319]
2         (-33.319, -17.08]
3           (37.551, 649.5]
4       (-354.685, -33.319]
               ...         
6997        (-3.416, 0.934]
6998          (0.934, 4.84]
6999          (0.934, 4.84]
7000          (0.934, 4.84]
7001        (-3.416, 0.934]
Name: change, Length: 7002, dtype: category
Categories (10, interval[float64]): [(-354.685, -33.319] < (-33.319, -17.08] < (-17.08, -9.298] < (-9.298, -3.416] ... (4.84, 10.614] < (10.614, 19.612] < (19.612, 37.551] < (37.551, 649.5]]
sr3.value_counts()
(37.551, 649.5]        701
(0.934, 4.84]          701
(-354.685, -33.319]    701
(19.612, 37.551]       700
(10.614, 19.612]       700
(-3.416, 0.934]        700
(-9.298, -3.416]       700
(-17.08, -9.298]       700
(-33.319, -17.08]      700
(4.84, 10.614]         699
Name: change, dtype: int64
# 3)离散化(获得哑变量/one-hot编码)
stock_change = pd.get_dummies(sr3, prefix='涨跌幅')
stock_change
涨跌幅_(-354.685, -33.319] 涨跌幅_(-33.319, -17.08] 涨跌幅_(-17.08, -9.298] 涨跌幅_(-9.298, -3.416] 涨跌幅_(-3.416, 0.934] 涨跌幅_(0.934, 4.84] 涨跌幅_(4.84, 10.614] 涨跌幅_(10.614, 19.612] 涨跌幅_(19.612, 37.551] 涨跌幅_(37.551, 649.5]
0 1 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 1
4 1 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ...
6997 0 0 0 0 1 0 0 0 0 0
6998 0 0 0 0 0 1 0 0 0 0
6999 0 0 0 0 0 1 0 0 0 0
7000 0 0 0 0 0 1 0 0 0 0
7001 0 0 0 0 1 0 0 0 0 0

7002 rows × 10 columns

# 自定义分组
bins = [-600, -300, 0, 300, 600, 900]
sr = pd.cut(change, bins)
sr
0       (-300, 0]
1       (-300, 0]
2       (-300, 0]
3        (0, 300]
4       (-300, 0]
          ...    
6997    (-300, 0]
6998     (0, 300]
6999     (0, 300]
7000     (0, 300]
7001     (0, 300]
Name: change, Length: 7002, dtype: category
Categories (5, interval[int64]): [(-600, -300] < (-300, 0] < (0, 300] < (300, 600] < (600, 900]]
sr.value_counts()
(0, 300]        3702
(-300, 0]       3290
(-600, -300]       7
(300, 600]         2
(600, 900]         1
Name: change, dtype: int64
stock_change = pd.get_dummies(sr, prefix='涨跌幅') # onr-hot/哑变量
stock_change
涨跌幅_(-600, -300] 涨跌幅_(-300, 0] 涨跌幅_(0, 300] 涨跌幅_(300, 600] 涨跌幅_(600, 900]
0 0 1 0 0 0
1 0 1 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 1 0 0 0
... ... ... ... ... ...
6997 0 1 0 0 0
6998 0 0 1 0 0
6999 0 0 1 0 0
7000 0 0 1 0 0
7001 0 0 1 0 0

7002 rows × 5 columns

Logo

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。

更多推荐