数据清洗之 数据离散化
数据清洗数据离散化
·
数据离散化
- 数据离散化就是分箱
- 一把你常用分箱方法是等频分箱或者等宽分箱
- 一般使用pd.cut或者pd.qcut函数
pandas.cut(x, bins, right=True, labels)
- x: 数据
- bins: 离散化的数目,或者切分的区间
- labels: 离散化后各个类别的标签
- right: 是否包含区间右边的值
import pandas as pd
import numpy as np
import os
os.getcwd()
'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据'
os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')
df = pd.read_csv('MotorcycleData.csv', encoding='gbk', na_values='Na')
def f(x):
if '$' in str(x):
x = str(x).strip('$')
x = str(x).replace(',', '')
else:
x = str(x).replace(',', '')
return float(x)
df['Price'] = df['Price'].apply(f)
df['Mileage'] = df['Mileage'].apply(f)
df.head(5)
| Condition | Condition_Desc | Price | Location | Model_Year | Mileage | Exterior_Color | Make | Warranty | Model | ... | Vehicle_Title | OBO | Feedback_Perc | Watch_Count | N_Reviews | Seller_Status | Vehicle_Tile | Auction | Buy_Now | Bid_Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Used | mint!!! very low miles | 11412.0 | McHenry, Illinois, United States | 2013.0 | 16000.0 | Black | Harley-Davidson | Unspecified | Touring | ... | NaN | FALSE | 8.1 | NaN | 2427 | Private Seller | Clear | True | FALSE | 28.0 |
| 1 | Used | Perfect condition | 17200.0 | Fort Recovery, Ohio, United States | 2016.0 | 60.0 | Black | Harley-Davidson | Vehicle has an existing warranty | Touring | ... | NaN | FALSE | 100 | 17 | 657 | Private Seller | Clear | True | TRUE | 0.0 |
| 2 | Used | NaN | 3872.0 | Chicago, Illinois, United States | 1970.0 | 25763.0 | Silver/Blue | BMW | Vehicle does NOT have an existing warranty | R-Series | ... | NaN | FALSE | 100 | NaN | 136 | NaN | Clear | True | FALSE | 26.0 |
| 3 | Used | CLEAN TITLE READY TO RIDE HOME | 6575.0 | Green Bay, Wisconsin, United States | 2009.0 | 33142.0 | Red | Harley-Davidson | NaN | Touring | ... | NaN | FALSE | 100 | NaN | 2920 | Dealer | Clear | True | FALSE | 11.0 |
| 4 | Used | NaN | 10000.0 | West Bend, Wisconsin, United States | 2012.0 | 17800.0 | Blue | Harley-Davidson | NO WARRANTY | Touring | ... | NaN | FALSE | 100 | 13 | 271 | OWNER | Clear | True | TRUE | 0.0 |
5 rows × 22 columns
df['Price_bin'] = pd.cut(df['Price'], 5, labels=range(5))
# 计算频数
df['Price_bin'].value_counts()
0 6762
1 659
2 50
3 20
4 2
Name: Price_bin, dtype: int64
%matplotlib inline
df['Price_bin'].value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x1b35fba9048>
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-olyuNEbB-1587367665199)(output_12_1.png)]](https://i-blog.csdnimg.cn/blog_migrate/0a4aeabe29e29e8763365fae445840a1.png#pic_center)
df['Price_bin'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1b35f681278>
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kUCpxNzE-1587367665204)(output_13_1.png)]](https://i-blog.csdnimg.cn/blog_migrate/d0bd0836d082432a73be02bc7f4fff09.png#pic_center)
w = [100, 1000, 5000, 10000, 20000, 100000]
df['Price_bin'] = pd.cut(df['Price'], bins=w, labels=range(5))
df[['Price', 'Price_bin']].head(5)
| Price | Price_bin | |
|---|---|---|
| 0 | 11412.0 | 3 |
| 1 | 17200.0 | 3 |
| 2 | 3872.0 | 1 |
| 3 | 6575.0 | 2 |
| 4 | 10000.0 | 2 |
df['Price_bin'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1b35fb99898>
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-W11kWf50-1587367665206)(output_17_1.png)]](https://i-blog.csdnimg.cn/blog_migrate/462a30d73e6a5844eb324953248907da.png#pic_center)
# 分位数
k = 5
w = [1.0 * i/k for i in range(k+1)]
w
[0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
# 等频分成5段
df['Price_bin'] = pd.qcut(df['Price'], q=w, labels=range(5))
df['Price_bin'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1b35fe2a080>
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-B3njTZxo-1587367665209)(output_21_1.png)]](https://i-blog.csdnimg.cn/blog_migrate/c8507ff27d5de8afafb5f8dd1235d324.png#pic_center)
# 计算分位点
k = 5
w1 = df['Price'].quantile([1.0 * i/k for i in range(k+1)])
w1
0.0 0.0
0.2 3500.0
0.4 6491.0
0.6 9777.0
0.8 14999.0
1.0 100000.0
Name: Price, dtype: float64
# 一般第一个分位点要比实际小
# 最后一个分位点要比实际大
w1[0] = w[0] * 0.95
w1[1.0] = w1[1.0] * 1.1
w1
0.0 0.0
0.2 3500.0
0.4 6491.0
0.6 9777.0
0.8 14999.0
1.0 110000.0
Name: Price, dtype: float64
# 按照新的分段标准分割
df['Price_bin'] = pd.cut(df['Price'], bins=w1, labels=range(5))
df['Price_bin'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1b35e53fa20>
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CbT03bmk-1587367665212)(output_27_1.png)]](https://i-blog.csdnimg.cn/blog_migrate/5be2e6a6961c2461585a45c96a7799c3.png#pic_center)
魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。
更多推荐


所有评论(0)