3.9 高级处理——交叉表与透视表

  • 3.9.1 交叉表与透视表有什么作用
  • 3.9.2 使用crosstab(交叉表)实现
  • 3.9.3 使用pivot_table(透视表)实现

3.9.1 交叉表与透视表的作用

  • 探究两个变量之间的关系

3.9.2 使用crosstab(交叉表)实现

  • 交叉表:交叉表用于计算一列数据对于另一列数据的分组个数(寻找两个列之间的关系)
    • pd.crosstab(value1, value2)

1.准备数据
  • 星期数据以及涨跌幅是好是坏数据
    • pd.crosstab(星期数据列, 涨跌幅数据列)
import pandas as pd
stock = pd.read_excel('stocks.xlsx', index_col=0)
stock
close open high low pre_close change pct_chg vol amount
trade_date
2020-03-13 2887.4265 2804.2322 2910.8812 2799.9841 2923.4856 -36.0591 -1.2334 366450436 393019665.2
2020-03-14 2923.4856 2936.0163 2944.4651 2906.2838 2968.5174 -45.0318 -1.5170 307778457 328209202.4
2020-03-15 2968.5174 3001.7616 3010.0286 2968.5174 2996.7618 -28.2444 -0.9425 352470970 378766619.0
2020-03-16 2996.7618 2918.9347 3000.2963 2904.7989 2943.2907 53.4711 1.8167 393296648 425017184.8
2020-03-17 2943.2907 2987.1805 2989.2051 2940.7138 3034.5113 -91.2206 -3.0061 414560736 438143854.6
2020-03-18 3034.5113 3039.9395 3052.4439 3029.4632 3071.6771 -37.1658 -1.2100 362061533 377388542.7
2020-03-19 3071.6771 3036.1545 3074.2571 3022.9262 3011.6657 60.0114 1.9926 445425806 482770471.4
2020-03-20 3011.6657 2981.8060 3012.0035 2974.3583 2992.8968 18.7689 0.6271 353338278 389893917.5
2020-03-21 2992.8968 3006.8888 3026.8420 2976.6230 2970.9312 21.9656 0.7394 410108047 447053681.5
2020-03-22 2970.9312 2899.3100 2982.5068 2899.3100 2880.3038 90.6274 3.1465 367333369 397244201.2
2020-03-23 2880.3038 2924.6407 2948.1261 2878.5443 2991.3288 -111.0250 -3.7116 401216914 432657775.0
2020-03-24 2991.3288 2992.4919 3009.4575 2980.4774 2987.9287 3.4001 0.1138 350523658 395955641.5
2020-03-25 2987.9287 2978.4195 3028.7788 2974.9423 3013.0501 -25.1214 -0.8338 469049552 495341447.3
2020-03-26 3013.0501 2982.0696 3016.9458 2943.7168 3031.2333 -18.1832 -0.5999 441622762 513128644.6
2020-03-27 3031.2333 3027.8925 3042.1821 3007.3557 3039.6692 -8.4359 -0.2775 370430044 451601363.1
2020-03-28 3039.6692 3022.2455 3058.8980 3020.1410 3030.1542 9.5150 0.3140 364557276 445062076.7
2020-03-29 3030.1542 2981.8802 3031.3706 2968.4451 2975.4019 54.7523 1.8402 345732881 413761364.1
2020-03-30 2975.4019 2979.5223 2998.2718 2971.8219 2984.9716 -9.5697 -0.3206 315141151 381331160.4
2020-03-31 2984.9716 2981.4097 2990.6003 2960.7751 2983.6224 1.3492 0.0452 311665913 374998562.6
2020-04-01 2983.6224 2924.9913 2983.6371 2924.9913 2917.0077 66.6147 2.2837 313198007 367014340.1
stock.index
DatetimeIndex(['2020-03-13', '2020-03-14', '2020-03-15', '2020-03-16',
               '2020-03-17', '2020-03-18', '2020-03-19', '2020-03-20',
               '2020-03-21', '2020-03-22', '2020-03-23', '2020-03-24',
               '2020-03-25', '2020-03-26', '2020-03-27', '2020-03-28',
               '2020-03-29', '2020-03-30', '2020-03-31', '2020-04-01'],
              dtype='datetime64[ns]', name='trade_date', freq=None)
pandas日期类型

pd.to_datetime()

# 将标准格式的输入参数转化为pandas的日期类型
date = pd.to_datetime(stock.index)
date # 0表示星期一
DatetimeIndex(['2020-03-13', '2020-03-14', '2020-03-15', '2020-03-16',
               '2020-03-17', '2020-03-18', '2020-03-19', '2020-03-20',
               '2020-03-21', '2020-03-22', '2020-03-23', '2020-03-24',
               '2020-03-25', '2020-03-26', '2020-03-27', '2020-03-28',
               '2020-03-29', '2020-03-30', '2020-03-31', '2020-04-01'],
              dtype='datetime64[ns]', name='trade_date', freq=None)
date.year
Int64Index([2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,
            2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020],
           dtype='int64', name='trade_date')
date.weekday
Int64Index([4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2], dtype='int64', name='trade_date')
# 准备星期数据列
stock['weekday'] = date.weekday
stock
close open high low pre_close change pct_chg vol amount weekday
trade_date
2020-03-13 2887.4265 2804.2322 2910.8812 2799.9841 2923.4856 -36.0591 -1.2334 366450436 393019665.2 4
2020-03-14 2923.4856 2936.0163 2944.4651 2906.2838 2968.5174 -45.0318 -1.5170 307778457 328209202.4 5
2020-03-15 2968.5174 3001.7616 3010.0286 2968.5174 2996.7618 -28.2444 -0.9425 352470970 378766619.0 6
2020-03-16 2996.7618 2918.9347 3000.2963 2904.7989 2943.2907 53.4711 1.8167 393296648 425017184.8 0
2020-03-17 2943.2907 2987.1805 2989.2051 2940.7138 3034.5113 -91.2206 -3.0061 414560736 438143854.6 1
2020-03-18 3034.5113 3039.9395 3052.4439 3029.4632 3071.6771 -37.1658 -1.2100 362061533 377388542.7 2
2020-03-19 3071.6771 3036.1545 3074.2571 3022.9262 3011.6657 60.0114 1.9926 445425806 482770471.4 3
2020-03-20 3011.6657 2981.8060 3012.0035 2974.3583 2992.8968 18.7689 0.6271 353338278 389893917.5 4
2020-03-21 2992.8968 3006.8888 3026.8420 2976.6230 2970.9312 21.9656 0.7394 410108047 447053681.5 5
2020-03-22 2970.9312 2899.3100 2982.5068 2899.3100 2880.3038 90.6274 3.1465 367333369 397244201.2 6
2020-03-23 2880.3038 2924.6407 2948.1261 2878.5443 2991.3288 -111.0250 -3.7116 401216914 432657775.0 0
2020-03-24 2991.3288 2992.4919 3009.4575 2980.4774 2987.9287 3.4001 0.1138 350523658 395955641.5 1
2020-03-25 2987.9287 2978.4195 3028.7788 2974.9423 3013.0501 -25.1214 -0.8338 469049552 495341447.3 2
2020-03-26 3013.0501 2982.0696 3016.9458 2943.7168 3031.2333 -18.1832 -0.5999 441622762 513128644.6 3
2020-03-27 3031.2333 3027.8925 3042.1821 3007.3557 3039.6692 -8.4359 -0.2775 370430044 451601363.1 4
2020-03-28 3039.6692 3022.2455 3058.8980 3020.1410 3030.1542 9.5150 0.3140 364557276 445062076.7 5
2020-03-29 3030.1542 2981.8802 3031.3706 2968.4451 2975.4019 54.7523 1.8402 345732881 413761364.1 6
2020-03-30 2975.4019 2979.5223 2998.2718 2971.8219 2984.9716 -9.5697 -0.3206 315141151 381331160.4 0
2020-03-31 2984.9716 2981.4097 2990.6003 2960.7751 2983.6224 1.3492 0.0452 311665913 374998562.6 1
2020-04-01 2983.6224 2924.9913 2983.6371 2924.9913 2917.0077 66.6147 2.2837 313198007 367014340.1 2
# 准备涨跌幅数据列
import numpy as np
stock['pon'] = np.where(stock['change'] > 0, 1, 0)
stock
close open high low pre_close change pct_chg vol amount weekday pon
trade_date
2020-03-13 2887.4265 2804.2322 2910.8812 2799.9841 2923.4856 -36.0591 -1.2334 366450436 393019665.2 4 0
2020-03-14 2923.4856 2936.0163 2944.4651 2906.2838 2968.5174 -45.0318 -1.5170 307778457 328209202.4 5 0
2020-03-15 2968.5174 3001.7616 3010.0286 2968.5174 2996.7618 -28.2444 -0.9425 352470970 378766619.0 6 0
2020-03-16 2996.7618 2918.9347 3000.2963 2904.7989 2943.2907 53.4711 1.8167 393296648 425017184.8 0 1
2020-03-17 2943.2907 2987.1805 2989.2051 2940.7138 3034.5113 -91.2206 -3.0061 414560736 438143854.6 1 0
2020-03-18 3034.5113 3039.9395 3052.4439 3029.4632 3071.6771 -37.1658 -1.2100 362061533 377388542.7 2 0
2020-03-19 3071.6771 3036.1545 3074.2571 3022.9262 3011.6657 60.0114 1.9926 445425806 482770471.4 3 1
2020-03-20 3011.6657 2981.8060 3012.0035 2974.3583 2992.8968 18.7689 0.6271 353338278 389893917.5 4 1
2020-03-21 2992.8968 3006.8888 3026.8420 2976.6230 2970.9312 21.9656 0.7394 410108047 447053681.5 5 1
2020-03-22 2970.9312 2899.3100 2982.5068 2899.3100 2880.3038 90.6274 3.1465 367333369 397244201.2 6 1
2020-03-23 2880.3038 2924.6407 2948.1261 2878.5443 2991.3288 -111.0250 -3.7116 401216914 432657775.0 0 0
2020-03-24 2991.3288 2992.4919 3009.4575 2980.4774 2987.9287 3.4001 0.1138 350523658 395955641.5 1 1
2020-03-25 2987.9287 2978.4195 3028.7788 2974.9423 3013.0501 -25.1214 -0.8338 469049552 495341447.3 2 0
2020-03-26 3013.0501 2982.0696 3016.9458 2943.7168 3031.2333 -18.1832 -0.5999 441622762 513128644.6 3 0
2020-03-27 3031.2333 3027.8925 3042.1821 3007.3557 3039.6692 -8.4359 -0.2775 370430044 451601363.1 4 0
2020-03-28 3039.6692 3022.2455 3058.8980 3020.1410 3030.1542 9.5150 0.3140 364557276 445062076.7 5 1
2020-03-29 3030.1542 2981.8802 3031.3706 2968.4451 2975.4019 54.7523 1.8402 345732881 413761364.1 6 1
2020-03-30 2975.4019 2979.5223 2998.2718 2971.8219 2984.9716 -9.5697 -0.3206 315141151 381331160.4 0 0
2020-03-31 2984.9716 2981.4097 2990.6003 2960.7751 2983.6224 1.3492 0.0452 311665913 374998562.6 1 1
2020-04-01 2983.6224 2924.9913 2983.6371 2924.9913 2917.0077 66.6147 2.2837 313198007 367014340.1 2 1
# 产生交叉表
data = pd.crosstab(stock['weekday'], stock['pon'])
# 反映了星期几涨的有多少,跌的有多少
data
pon 0 1
weekday
0 2 1
1 1 2
2 2 1
3 1 1
4 2 1
5 1 2
6 1 2
# 计算得到百分比
data.sum(axis=1)
weekday
0    3
1    3
2    3
3    2
4    3
5    3
6    3
dtype: int64
# 除数按列方向逐个除,对应不上除数的被除数为NaN被除,得到NaN
data.div(data.sum(axis=1)) # 用data除以data.sum(axis=1)
0 1 2 3 4 5 6
weekday
0 0.666667 0.333333 NaN NaN NaN NaN NaN
1 0.333333 0.666667 NaN NaN NaN NaN NaN
2 0.666667 0.333333 NaN NaN NaN NaN NaN
3 0.333333 0.333333 NaN NaN NaN NaN NaN
4 0.666667 0.333333 NaN NaN NaN NaN NaN
5 0.333333 0.666667 NaN NaN NaN NaN NaN
6 0.333333 0.666667 NaN NaN NaN NaN NaN
# 除数按列方向逐个除,对应不上除数的被除数为NaN被除,得到NaN
data / data.sum(axis=1)
0 1 2 3 4 5 6
weekday
0 0.666667 0.333333 NaN NaN NaN NaN NaN
1 0.333333 0.666667 NaN NaN NaN NaN NaN
2 0.666667 0.333333 NaN NaN NaN NaN NaN
3 0.333333 0.333333 NaN NaN NaN NaN NaN
4 0.666667 0.333333 NaN NaN NaN NaN NaN
5 0.333333 0.666667 NaN NaN NaN NaN NaN
6 0.333333 0.666667 NaN NaN NaN NaN NaN
# 按行方向一行一行地除
data.div(data.sum(axis=1), axis=0) # 唯一正确解
pon 0 1
weekday
0 0.666667 0.333333
1 0.333333 0.666667
2 0.666667 0.333333
3 0.500000 0.500000
4 0.666667 0.333333
5 0.333333 0.666667
6 0.333333 0.666667
# 除数按列方向逐个除,对应不上除数的被除数为NaN被除,得到NaN
data.div(data.sum(axis=1), axis=1)
0 1 2 3 4 5 6
weekday
0 0.666667 0.333333 NaN NaN NaN NaN NaN
1 0.333333 0.666667 NaN NaN NaN NaN NaN
2 0.666667 0.333333 NaN NaN NaN NaN NaN
3 0.333333 0.333333 NaN NaN NaN NaN NaN
4 0.666667 0.333333 NaN NaN NaN NaN NaN
5 0.333333 0.666667 NaN NaN NaN NaN NaN
6 0.333333 0.666667 NaN NaN NaN NaN NaN
# 画图显示
data.div(data.sum(axis=1), axis=0).plot(kind='bar')
data.div(data.sum(axis=1), axis=0).plot(kind='bar', stacked=True)

在这里插入图片描述

3.9.3 使用pivot_table(透视表)实现

  • DataFrame.pivot_table([], index=[])

使用透视表,刚才的过程更加简单

# 使用透视表,将整个过程变得更简单一些
stock.pivot_table(['pon'], index=['weekday']) # 得到的是['pon']列的‘1’对应的百分比
pon
weekday
0 0.333333
1 0.666667
2 0.333333
3 0.500000
4 0.333333
5 0.666667
6 0.666667
Logo

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。

更多推荐