Pandas合并数据集:合并与连接

Pandas的一个基本特性就是提供了基于内存的高性能的内存式数据连接(join)与合并操作(merge).Pandas中的这类操作和数据库非常相似

Pandas的主接口就是pd.merge函数,下面将基于实例来讲解

关系代数

实际上pd.merge函数实现的功能的理论基础是关系代数(relational algbra).关系代数是处理关系型数据的通用理论.不仅仅是Pandas中的merge函数,绝大部分数据库的可用操作都是基于关系代数.

关系代数相关的方法论的强大之处在于它提出的若干个简单操作规则经过组合就可以实现对任意数据集进行复杂的操作.

借助于数据库或者程序中已经高效实现的这些基本操作规则,我们可以完成许多非常复杂的操作

类似于数据库,Pandas中的pd.merge函数与Series和DataFrame中的join方法里都实现了这些基本规则

下面就将基于数据连接的类型来讲解pd.merge函数和join方法

数据连接规则

Pandas的merge函数实现了三种的数据连接的类型:一对一,多对一和多对多

这三种数据连接的类型类似于Numpy的广播,都是由Pandas在后台自动完成的,即我们连接的两个DataFrame对象满足连接的规则时,就会自动进行连接

一对一连接

一对一连接是最普遍的数据连接方式,指在连接的两个数据表中以共有的一列作为连接的键来连接两个数据表,其中两个数据表的共有列中的每个值都是唯一的,

例如,Employee作为共有的列连接的第一个DataFrame对象的Employee这一列的值为[‘A’,'B’,’C’],在第二个DataFrame对象中Employee这一列的值为[‘A’,'B’,’C’]

注意,共有列是按照列的名称及Columns的值来匹配的,如果没有相同名称的列将会报错

按照某共有列中的元素进行一一对应的连接

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2))
>>>
  Employee        Group
0     Jack   Accounting
1      Sue  Engineering
2     Suzy  Engineering
3      Kim           HR

  Employee HireDate
0     Jack     2020
1      Sue     2019
2     Suzy     2021
3      Kim     2018

  Employee        Group HireDate
0     Jack   Accounting     2020
1      Sue  Engineering     2019
2     Suzy  Engineering     2021
3      Kim           HR     2018

在上面的例子中,Pandas后台进行的操作是首先在两个DataFrame对象的列中进行查找公共列,查找到两个DataFrame对象都具有Employee这一列之后,逐步遍历两个DataFrame对象中的这一列,将相同的键(即上面例子中的人名)后的数据连接起来,所以我们看到的结果就是两个数据表以上面的方式连接起来.

我们从上面的讲解中可以推断出,即便我们作为公共列中具体的每个键即顺序不一样,但是因为是对两个DataFrame对象的columns进行遍历查找,所以依旧是能够成功拼接的

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Kim','Jack','Sue','Suzy'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2))
>>>
  Employee        Group
0     Jack   Accounting
1      Sue  Engineering
2     Suzy  Engineering
3      Kim           HR

  Employee HireDate
0      Kim     2020
1     Jack     2019
2      Sue     2021
3     Suzy     2018

  Employee        Group HireDate
0     Jack   Accounting     2019
1      Sue  Engineering     2021
2     Suzy  Engineering     2018
3      Kim           HR     2020

注意,为了便于讲解,我们这里让两个数据表中公共列具有相同的元素,如果具有不同的元素(从集合的角度来说由两个列中的元素分别构成的集合不等价)会触发另外的规则.为了避免造成混淆,这里就先单一化处理

此外merge函数默认会丢弃掉行索引.

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']},index=list('ABCD'))
DataFrame_2=pd.DataFrame({'Employee':['Kim','Jack','Sue','Suzy'],'HireDate':['2020','2019','2021','2018']})
DataFrame_3=pd.DataFrame({'Names':['Jack','Sue','Suzy','Kim'],'Times':['2020','2019','2021','2018']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2))
>>>
  Employee        Group
A     Jack   Accounting
B      Sue  Engineering
C     Suzy  Engineering
D      Kim           HR

  Employee HireDate
0      Kim     2020
1     Jack     2019
2      Sue     2021
3     Suzy     2018

  Employee        Group HireDate
0     Jack   Accounting     2019
1      Sue  Engineering     2021
2     Suzy  Engineering     2018
3      Kim           HR     2020

多对一连接

多对一连接是指需要连接的两个DataFrame对象中作为连接的公共列,在一个DataFrame对象中具有重复(即多),在另一个DataFrame对象中只出现了一次(即一)

例如将两个DataFrame对象中名称都为Employee的列作为公共列进行连接,第一个DataFrame对象中Employee的值为[‘A’,‘B’,'C’],在第二个DataFrame对象中Employee的值为[‘A’,‘A’,‘B’,‘C’]

此时进行的操作就是会重复连接,即将以一多次连接到多

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim','Jack'],'Group':['Accounting','Engineering','Engineering','HR','CEO']},index=list('ABCDE'))
DataFrame_2=pd.DataFrame({'Employee':['Kim','Jack','Sue','Suzy',],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2))
>>>
  Employee        Group
A     Jack   Accounting
B      Sue  Engineering
C     Suzy  Engineering
D      Kim           HR
E     Jack          CEO

  Employee HireDate
0      Kim     2020
1     Jack     2019
2      Sue     2021
3     Suzy     2018

  Employee        Group HireDate
0     Jack   Accounting     2019
1     Jack          CEO     2019
2      Sue  Engineering     2021
3     Suzy  Engineering     2018
4      Kim           HR     2020

上面的例子中以Employee作为共有列进行连接,在DataFrame_1中Jack出现了两次,所以会将DataFrame_2中唯一的Jack后面的值重复连接到DataFrame_1中两个’Jack’后面去

多对多连接

多对多连接是指作为共有的列在两个DataFrame对象中都具有重复的值,对应的操作就是将两者同时扩充后连接

例如,以Employee为共有列,第一个DataFrame中Employee列的值为[‘A’,‘A’,‘B’,‘C’],第二个DataFrame对象中Employee列的值为[‘A’,‘B’,‘C’,‘C’]

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim','Jack'],'Group':['Accounting','Engineering','Engineering','HR','CEO']},index=list('ABCDE'))
DataFrame_2=pd.DataFrame({'Employee':['Kim','Jack','Sue','Suzy','Kim'],'HireDate':['2020','2019','2021','2018','2020']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2))
>>>
  Employee        Group
A     Jack   Accounting
B      Sue  Engineering
C     Suzy  Engineering
D      Kim           HR
E     Jack          CEO
  Employee HireDate
0      Kim     2020
1     Jack     2019
2      Sue     2021
3     Suzy     2018
4      Kim     2020
  Employee        Group HireDate
0     Jack   Accounting     2019
1     Jack          CEO     2019
2      Sue  Engineering     2021
3     Suzy  Engineering     2018
4      Kim           HR     2020
5      Kim           HR     2020

在上面的例子中,连接的共有列是Employee,DataFrame_1中Employee的值为[‘Jack’,‘Sue’,‘Suzy’,‘Kim’,‘Jack’],其中Jack重复了两次,DataFrame_2中Employee的值为[‘Kim’,‘Jack’,‘Sue’,‘Suzy’,‘Kim’],其中Kim重复了两次

对应的操作就是将DataFrame_2中唯一的Jack重复复制了两次,分别连接到DataFrame_1中的两个Jack后面去,同理,DataFrame_1中唯一的Kim重复复制了两次,分别连接到DataFrame_2中的两个Kim后面去

假如两个DataFrame对象中重复的都是Jack,那么会出现下面的结果

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim','Jack'],'Group':['Accounting','Engineering','Engineering','HR','CEO']},index=list('ABCDE'))
DataFrame_2=pd.DataFrame({'Employee':['Kim','Jack','Sue','Suzy','Jack'],'HireDate':['2020','2019','2021','2018','2020']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2))
>>>
  Employee        Group
A     Jack   Accounting
B      Sue  Engineering
C     Suzy  Engineering
D      Kim           HR
E     Jack          CEO
  Employee HireDate
0      Kim     2020
1     Jack     2019
2      Sue     2021
3     Suzy     2018
4     Jack     2020
  Employee        Group HireDate
0     Jack   Accounting     2019
1     Jack   Accounting     2020
2     Jack          CEO     2019
3     Jack          CEO     2020
4      Sue  Engineering     2021
5     Suzy  Engineering     2018
6      Kim           HR     2020

按照同样的思路,DataFrame_1中的作为会计的Jack会复制两次连接到DataFrame_2中的两个Jack中去,于是就有了

0     Jack   Accounting     2019
1     Jack   Accounting     2020

然后作为CEO的Jack也会复制两次到DataFrame_2中的两个Jack去,就有了

2     Jack          CEO     2019
3     Jack          CEO     2020

接下来将DataFrame_2中的第一个Jack再复制到第一个DataFarme_1中的Jack去,有了

0     Jack   Accounting     2019
2     Jack          CEO     2019

同理,将第二个Jack再复制连接到第一个DataFrame中的Jack,就有了

1     Jack   Accounting     2020
3     Jack          CEO     2020

所以结合以上的步骤,最终的得到如上的表

设置数据合并的键

前面介绍的是pd.merge函数的默认规则,但是为了能够更好使用pd.merge函数来合并数据集,我们实际上可以指定pd.merge函数的一些参数来完成我们指定的操作

我们将讲解pd.merge的一些参数

on参数

on参数用于设置作为合并的列名,可以为其传递一个列名字符串或者列名列表

在当on参数缺失的时候,默认会以第一个匹配的列名作为合并的键

并且如果我们指定的列名并不是两个数据表共有的,就会报错

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','Jack','Suzy'],'HireDate':[2020,2019,2021,2018]})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2,on='Employee'))
>>>
  Employee        Group
0     Jack   Accounting
1      Sue  Engineering
2     Suzy  Engineering
3      Kim           HR
  Employee  HireDate
0      Sue      2020
1      Kim      2019
2     Jack      2021
3     Suzy      2018
  Employee        Group  HireDate
0     Jack   Accounting      2021
1      Sue  Engineering      2020
2     Suzy  Engineering      2018
3      Kim           HR      2019

注意,传递给on参数的值必须是两个对象中共有的列名,否则会报错

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','Jack','Suzy'],'HireDate':[2020,2019,2021,2018]})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2,on='Group'))
>>>
 Employee        Group
0     Jack   Accounting
1      Sue  Engineering
2     Suzy  Engineering
3      Kim           HR
  Employee  HireDate
0      Sue      2020
1      Kim      2019
2     Jack      2021
3     Suzy      2018
Traceback (most recent call last):
  File "TryPandas.py", line 535, in <module>
    print(pd.merge(DataFrame_1,DataFrame_2,on='Group'))
  File "/home/jackwang/.pyenv/versions/3.6.8/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 86, in merge
    validate=validate,
  File "/home/jackwang/.pyenv/versions/3.6.8/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 627, in __init__
    ) = self._get_merge_keys()
  File "/home/jackwang/.pyenv/versions/3.6.8/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 983, in _get_merge_keys
    right_keys.append(right._get_label_or_level_values(rk))
  File "/home/jackwang/.pyenv/versions/3.6.8/lib/python3.6/site-packages/pandas/core/generic.py", line 1692, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'Group'

left_on与right_on参数

默认情况下,merge函数将会匹配两个数据表中同名的列,然后以这个列中的值为键来连接两个表的数据

但是我们可以指定left_on与right_on参数来指定左右两个表中用来作为键的列名

例如数据表A拼接在数据表B的左边,数据表B拼接在数据表A的右边,则left_on指定A表的列名,right_on指定B表的列名

通常指定left_on与right_on参数用于两个数据表中没有同名但是有同值的列,这个时候我们先将两个数据表按照值的对应(包括一对一,多对一和多对多)拼接在一起,然后使用drop方法删去不需要的列

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Names':['Sue','Kim','Jack','Suzy'],'HireDate':[2020,2019,2021,2018]})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_2,DataFrame_1,left_on='Names',right_on='Employee'))
print(pd.merge(DataFrame_1,DataFrame_2,left_on='Employee',right_on='Names'))
print(pd.merge(DataFrame_1,DataFrame_2,left_on='Employee',right_on='Names').drop('Names',axis=1))
>>>
  Employee        Group
0     Jack   Accounting
1      Sue  Engineering
2     Suzy  Engineering
3      Kim           HR

  Names  HireDate
0   Sue      2020
1   Kim      2019
2  Jack      2021
3  Suzy      2018

  Names  HireDate Employee        Group
0   Sue      2020      Sue  Engineering
1   Kim      2019      Kim           HR
2  Jack      2021     Jack   Accounting
3  Suzy      2018     Suzy  Engineering

  Employee        Group Names  HireDate
0     Jack   Accounting  Jack      2021
1      Sue  Engineering   Sue      2020
2     Suzy  Engineering  Suzy      2018
3      Kim           HR   Kim      2019

  Employee        Group  HireDate
0     Jack   Accounting      2021
1      Sue  Engineering      2020
2     Suzy  Engineering      2018
3      Kim           HR      2019

Left_index和Right_index参数

前面讲过,使用pd.merge函数时候得到的拼接后的列默认会丢弃掉原来两个数据表的行索引(index),我们可以指定left_index和right_index来保留原有的索引.

我们前面在层级索引中讲过,对于DataFrame对象,我们可以使用reset_index和set_index方法来重置或者设置行索引名,结合这个方法,我们常用的操作是读取文件,然后将其中的一列作为index

这个时候我们还想要合并两个数据集的话,直接使用merge函数会造成索引的丢弃,例如

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Names':['Sue','Kim','Jack','Suzy'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1.set_index('Group'))
print(DataFrame_2.set_index('HireDate'))
print(pd.merge(DataFrame_1.set_index('Group'),DataFrame_2.set_index('HireDate'),left_on='Employee',right_on='Names'))
>>>
            Employee
Group               
Accounting      Jack
Engineering      Sue
Engineering     Suzy
HR               Kim
         Names
HireDate      
2020       Sue
2019       Kim
2021      Jack
2018      Suzy
  Employee Names
0     Jack  Jack
1      Sue   Sue
2     Suzy  Suzy
3      Kim   Kim

首先需要注意的,上面的例子中把两个数据表的Group和HireDate作为行索引(index),这是因为如果把Employee和Names作为行索引(index),那么我们拼接的时候没有具有共同的值的列作为键,这样会引发后面会讲到的DataFrame合并时的另外的规则,结果就是返回一个空DataFrame对象

为此,我们可以使用left_index和right_index来指定保留行索引

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','Jack','Suzy'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1.set_index('Employee'))
print(DataFrame_2.set_index('Employee'))
print(pd.merge(DataFrame_1.set_index('Employee'),DataFrame_2.set_index('Employee'),\
    left_index=True,right_index=True))
>>>
                Group
Employee             
Jack       Accounting
Sue       Engineering
Suzy      Engineering
Kim                HR

         HireDate
Employee         
Sue          2020
Kim          2019
Jack         2021
Suzy         2018

                Group HireDate
Employee                      
Jack       Accounting     2021
Sue       Engineering     2020
Suzy      Engineering     2018
Kim HR     2019

注意,我们设置完行索引(index)后,如果两个数据表的行索引的值不同,即便有相同的值的列,依旧会得到一个空表,这依旧是因为触发了后面会讲到的数据连接时的集合操作规则

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','Jack','Suzy'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1.set_index('Group'))
print(DataFrame_2.set_index('HireDate'))
print(pd.merge(DataFrame_1.set_index('Group'),DataFrame_2.set_index('HireDate'),\
    left_index=True,right_index=True))
>>>
            Employee
Group               
Accounting      Jack
Engineering      Sue
Engineering     Suzy
HR               Kim
         Employee
HireDate         
2020          Sue
2019          Kim
2021         Jack
2018         Suzy
Empty DataFrame
Columns: [Employee_x, Employee_y]
Index: []

join方法

实际上我们想要实现合并索引的效果,除了上面讲的方法外,我们也可以直接调用join方法

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','Jack','Suzy'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1.set_index('Employee'))
print(DataFrame_2.set_index('Employee'))
print(DataFrame_1.set_index('Employee').join(DataFrame_2.set_index('Employee')))
>>>
                Group
Employee             
Jack       Accounting
Sue       Engineering
Suzy      Engineering

Kim                HR
         HireDate
Employee         
Sue          2020
Kim          2019
Jack         2021
Suzy         2018

                Group HireDate
Employee                      
Jack       Accounting     2021
Sue       Engineering     2020
Suzy      Engineering     2018
Kim                HR     2019

设置数据连接的集合操作规则

前面只是讲解了数据连接的类型以及pd.merge函数中可用的参数,但是真正想要高效的进行数据连接,我们需要明白集合操作规则

我们在前面的几个例子额外提到的效果其实都是因为数据连接时的集合操作规则

我们将连接的数据表(这里是两个DataFrame对象)分为左右两张表

内连接是指只连接两张表中匹配的内容

外连接是指连接两张表中全部的内容

左连接是指在右表中匹配并连接与左表匹配的内容

右连接是指在左表中匹配并连接与右表匹配的内容

实现具体的连接方式就是指定pd.merge函数的how参数

内连接

内连接是指连接两个对象返回两者的交集的连接方式,我们可以指定pd.merge函数的how参数为inner设定内连接,how参数缺失时默认为inner

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','May','John'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2,how='inner'))
>>>
  Employee        Group
0     Jack   Accounting
1      Sue  Engineering
2     Suzy  Engineering
3      Kim           HR
  Employee HireDate
0      Sue     2020
1      Kim     2019
2      May     2021
3     John     2018
  Employee        Group HireDate
0      Sue  Engineering     2020
1      Kim           HR     2019

上面的例子中,我们指定了用于合并的列是Employee这一列,那么在指定这一列中将会搜寻共有的元素,即Sue和Kim

注意,判断某一行是否两个表的公共列时候,如果我们指定了行索引的话,也会将行索引纳入判断的范围,例如

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','Jack','Suzy'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1.set_index('Group'))
print(DataFrame_2.set_index('HireDate'))
print(pd.merge(DataFrame_1.set_index('Group'),DataFrame_2.set_index('HireDate'),\
    left_index=True,right_index=True,left_on='Employee',right_on='Employee',how='inner'))
>>>
            Employee
Group               
Accounting      Jack
Engineering      Sue
Engineering     Suzy
HR               Kim
         Employee
HireDate         
2020          Sue
2019          Kim
2021         Jack
2018         Suzy
Empty DataFrame
Columns: [Employee, Employee_x, Employee_y]
Index: []

所以为了避免出现这样的麻烦,我们进行数据分析前的数据预处理过程就是:

  1. 加载数据
  2. 合并数据
  3. 删去不需要的列
  4. 填充缺失值
  5. 设置行列索引名

外连接

与内连接相对,外连接只是返回的结果包含所有输入列,即返回输入的并集,缺失值用NaN代替

外连接需要我们指定how参数为outer

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','May','John'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2,how='outer'))
>>>
 Employee        Group
0     Jack   Accounting
1      Sue  Engineering
2     Suzy  Engineering
3      Kim           HR

  Employee HireDate
0      Sue     2020
1      Kim     2019
2      May     2021
3     John     2018

  Employee        Group HireDate
0     Jack   Accounting      NaN
1      Sue  Engineering     2020
2     Suzy  Engineering      NaN
3      Kim           HR     2019
4      May          NaN     2021
5     John          NaN     2018

左右连接

除了内外连接之外,我们还可以指定how参数为left / right,分别对应左右连接

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','May','Suzy','John','Jack'],'HireDate':['2020','2019','2021','2018','2017','2016']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2,how='left'))
print(pd.merge(DataFrame_1,DataFrame_2,how='right'))
>>>
  Employee        Group
0     Jack   Accounting
1      Sue  Engineering
2     Suzy  Engineering
3      Kim           HR
  Employee HireDate
0      Sue     2020
1      Kim     2019
2      May     2021
3     Suzy     2018
4     John     2017
5     Jack     2016
  Employee        Group HireDate
0     Jack   Accounting     2016
1      Sue  Engineering     2020
2     Suzy  Engineering     2018
3      Kim           HR     2019
  Employee        Group HireDate
0     Jack   Accounting     2016
1      Sue  Engineering     2020
2     Suzy  Engineering     2018
3      Kim           HR     2019
4      May          NaN     2021
5     John          NaN     2017

重复列名

有的时候我们需要连接的两个DataFrame对象具有重复的列名,pd.merge函数默认会为两个重复的列名自动添加后缀,但是我们可以使用suffixes参数来指定后缀

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','May','Suzy','John','Jack'],'Group':['2020','2019','2021','2018','2017','2016']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2,on='Employee'))
>>>
  Employee        Group
0     Jack   Accounting
1      Sue  Engineering
2     Suzy  Engineering
3      Kim           HR
  Employee Group
0      Sue  2020
1      Kim  2019
2      May  2021
3     Suzy  2018
4     John  2017
5     Jack  2016
  Employee      Group_x Group_y
0     Jack   Accounting    2016
1      Sue  Engineering    2020
2     Suzy  Engineering    2018
3      Kim           HR    2019

上面的例子中我们指定以Employee为键来合并两个数组,但是有两个名字同为Group的列,这时候Pandas默认会为两者添加后缀_x和_y

我们指定suffixes参数来指定后缀

DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','May','Suzy','John','Jack'],'Group':['2020','2019','2021','2018','2017','2016']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2,on='Employee',suffixes=['_Left','_Right']))
>>>
Employee        Group
0     Jack   Accounting
1      Sue  Engineering
2     Suzy  Engineering
3      Kim           HR
  Employee Group
0      Sue  2020
1      Kim  2019
2      May  2021
3     Suzy  2018
4     John  2017
5     Jack  2016
  Employee   Group_Left Group_Right
0     Jack   Accounting        2016
1      Sue  Engineering        2020
2     Suzy  Engineering        2018
3      Kim           HR        2019
Logo

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。

更多推荐