案例一

探究: 用户对物品类别的喜好细分降维.

数据:

  • products.csv 商品信息
  • order_products__prior.csv 订单与商品信息
  • orders.csv 用户的订单信息
  • aisles.csv 商品所属具体物品类别

步骤

  1. 合并各张表到一张表当中: pd.merge()
  2. 建立一个类似行, 列数据
  3. 使用 PCA 分析

步骤一

import pandas as pd
from sklearn.decomposition import PCA

# 读取四张表的数据
prior = pd.read_csv("order_products__prior.csv")
products = pd.read_csv("products.csv")
orders = pd.read_csv("orders.csv")
aisles = pd.read_csv("aisles.csv")

# 合并四张表到一张表
_mg = pd.merge(prior, products, on=["product_id", "product_id"])
_mg = pd.merge(_mg, orders, on=["order_id", "order_id"])
mt = pd.merge(_mg, aisles, on=["aisle_id", "aisle_id"])

print(mt.head())

输出结果:
0         2       33120  ...                     8.0   eggs
1        26       33120  ...                     7.0   eggs
2       120       33120  ...                    10.0   eggs
3       327       33120  ...                     8.0   eggs
4       390       33120  ...                     9.0   eggs

步骤二

import pandas as pd
from sklearn.decomposition import PCA

# 读取四张表的数据
prior = pd.read_csv("order_products__prior.csv")
products = pd.read_csv("products.csv")
orders = pd.read_csv("orders.csv")
aisles = pd.read_csv("aisles.csv")

# 合并四张表到一张表
_mg = pd.merge(prior, products, on=["product_id", "product_id"])
_mg = pd.merge(_mg, orders, on=["order_id", "order_id"])
mt = pd.merge(_mg, aisles, on=["aisle_id", "aisle_id"])

# 交叉表 (特殊的分组工具)
cross = pd.crosstab(mt["user_id"],mt["aisle"])

# 输出头5条数据
print(cross.head())

输出结果:
aisle    air fresheners candles  asian foods  ...  white wines  yogurt
user_id                                       ...                     
1                             0            0  ...            0       1
2                             0            3  ...            0      42
3                             0            0  ...            0       0
4                             0            0  ...            0       0
5                             0            2  ...            0       3

步骤三

import pandas as pd
from sklearn.decomposition import PCA

# 读取四张表的数据
prior = pd.read_csv("order_products__prior.csv")
products = pd.read_csv("products.csv")
orders = pd.read_csv("orders.csv")
aisles = pd.read_csv("aisles.csv")

# 合并四张表到一张表
_mg = pd.merge(prior, products, on=["product_id", "product_id"])
_mg = pd.merge(_mg, orders, on=["order_id", "order_id"])
mt = pd.merge(_mg, aisles, on=["aisle_id", "aisle_id"])

# 交叉表 (特殊的分组工具)
cross = pd.crosstab(mt["user_id"], mt["aisle"])

# 进行主成分分析
pca = PCA(n_components=0.9)
data = pca.fit_transform(cross)

# 输出数据
print(data)

输出结果:
[[-2.42156587e+01  2.42942720e+00 -2.46636975e+00 ...  6.86800336e-01
   1.69439402e+00 -2.34323022e+00]
 [ 6.46320806e+00  3.67511165e+01  8.38255336e+00 ...  4.12121252e+00
   2.44689740e+00 -4.28348478e+00]
 [-7.99030162e+00  2.40438257e+00 -1.10300641e+01 ...  1.77534453e+00
  -4.44194030e-01  7.86665571e-01]
 ...
 [ 8.61143331e+00  7.70129866e+00  7.95240226e+00 ... -2.74252456e+00
   1.07112531e+00 -6.31925661e-02]
 [ 8.40862199e+01  2.04187340e+01  8.05410372e+00 ...  7.27554259e-01
   3.51339470e+00 -1.79079914e+01]
 [-1.39534562e+01  6.64621821e+00 -5.23030367e+00 ...  8.25329076e-01
   1.38230701e+00 -2.41942061e+00]]

查看 data.shape, 我们可以发现 类别由 134 个变为了 27 个.

Logo

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。

更多推荐