8.Pandas合并数据集:合并与连接
文章目录Pandas合并数据集:合并与连接关系代数数据连接规则一对一连接多对一连接多对多连接设置数据合并的键on参数left_on与right_on参数Left_index和Right_index参数join方法设置数据连接的集合操作规则内连接外连接左右连接重复列名Pandas合并数据集:合并与连接Pandas的一个基本特性就是提供了基于内存的高性能的内存式数据连接(join)与合并操作(merg
文章目录
Pandas合并数据集:合并与连接
Pandas的一个基本特性就是提供了基于内存的高性能的内存式数据连接(join)与合并操作(merge).Pandas中的这类操作和数据库非常相似
Pandas的主接口就是pd.merge函数,下面将基于实例来讲解
关系代数
实际上pd.merge函数实现的功能的理论基础是关系代数(relational algbra).关系代数是处理关系型数据的通用理论.不仅仅是Pandas中的merge函数,绝大部分数据库的可用操作都是基于关系代数.
关系代数相关的方法论的强大之处在于它提出的若干个简单操作规则经过组合就可以实现对任意数据集进行复杂的操作.
借助于数据库或者程序中已经高效实现的这些基本操作规则,我们可以完成许多非常复杂的操作
类似于数据库,Pandas中的pd.merge函数与Series和DataFrame中的join方法里都实现了这些基本规则
下面就将基于数据连接的类型来讲解pd.merge函数和join方法
数据连接规则
Pandas的merge函数实现了三种的数据连接的类型:一对一,多对一和多对多
这三种数据连接的类型类似于Numpy的广播,都是由Pandas在后台自动完成的,即我们连接的两个DataFrame对象满足连接的规则时,就会自动进行连接
一对一连接
一对一连接是最普遍的数据连接方式,指在连接的两个数据表中以共有的一列作为连接的键来连接两个数据表,其中两个数据表的共有列中的每个值都是唯一的,
例如,Employee作为共有的列连接的第一个DataFrame对象的Employee这一列的值为[‘A’,'B’,’C’],在第二个DataFrame对象中Employee这一列的值为[‘A’,'B’,’C’]
注意,共有列是按照列的名称及Columns的值来匹配的,如果没有相同名称的列将会报错
按照某共有列中的元素进行一一对应的连接
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2))
>>>
Employee Group
0 Jack Accounting
1 Sue Engineering
2 Suzy Engineering
3 Kim HR
Employee HireDate
0 Jack 2020
1 Sue 2019
2 Suzy 2021
3 Kim 2018
Employee Group HireDate
0 Jack Accounting 2020
1 Sue Engineering 2019
2 Suzy Engineering 2021
3 Kim HR 2018
在上面的例子中,Pandas后台进行的操作是首先在两个DataFrame对象的列中进行查找公共列,查找到两个DataFrame对象都具有Employee这一列之后,逐步遍历两个DataFrame对象中的这一列,将相同的键(即上面例子中的人名)后的数据连接起来,所以我们看到的结果就是两个数据表以上面的方式连接起来.
我们从上面的讲解中可以推断出,即便我们作为公共列中具体的每个键即顺序不一样,但是因为是对两个DataFrame对象的columns进行遍历查找,所以依旧是能够成功拼接的
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Kim','Jack','Sue','Suzy'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2))
>>>
Employee Group
0 Jack Accounting
1 Sue Engineering
2 Suzy Engineering
3 Kim HR
Employee HireDate
0 Kim 2020
1 Jack 2019
2 Sue 2021
3 Suzy 2018
Employee Group HireDate
0 Jack Accounting 2019
1 Sue Engineering 2021
2 Suzy Engineering 2018
3 Kim HR 2020
注意,为了便于讲解,我们这里让两个数据表中公共列具有相同的元素,如果具有不同的元素(从集合的角度来说由两个列中的元素分别构成的集合不等价)会触发另外的规则.为了避免造成混淆,这里就先单一化处理
此外merge函数默认会丢弃掉行索引.
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']},index=list('ABCD'))
DataFrame_2=pd.DataFrame({'Employee':['Kim','Jack','Sue','Suzy'],'HireDate':['2020','2019','2021','2018']})
DataFrame_3=pd.DataFrame({'Names':['Jack','Sue','Suzy','Kim'],'Times':['2020','2019','2021','2018']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2))
>>>
Employee Group
A Jack Accounting
B Sue Engineering
C Suzy Engineering
D Kim HR
Employee HireDate
0 Kim 2020
1 Jack 2019
2 Sue 2021
3 Suzy 2018
Employee Group HireDate
0 Jack Accounting 2019
1 Sue Engineering 2021
2 Suzy Engineering 2018
3 Kim HR 2020
多对一连接
多对一连接是指需要连接的两个DataFrame对象中作为连接的公共列,在一个DataFrame对象中具有重复(即多),在另一个DataFrame对象中只出现了一次(即一)
例如将两个DataFrame对象中名称都为Employee的列作为公共列进行连接,第一个DataFrame对象中Employee的值为[‘A’,‘B’,'C’],在第二个DataFrame对象中Employee的值为[‘A’,‘A’,‘B’,‘C’]
此时进行的操作就是会重复连接,即将以一多次连接到多
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim','Jack'],'Group':['Accounting','Engineering','Engineering','HR','CEO']},index=list('ABCDE'))
DataFrame_2=pd.DataFrame({'Employee':['Kim','Jack','Sue','Suzy',],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2))
>>>
Employee Group
A Jack Accounting
B Sue Engineering
C Suzy Engineering
D Kim HR
E Jack CEO
Employee HireDate
0 Kim 2020
1 Jack 2019
2 Sue 2021
3 Suzy 2018
Employee Group HireDate
0 Jack Accounting 2019
1 Jack CEO 2019
2 Sue Engineering 2021
3 Suzy Engineering 2018
4 Kim HR 2020
上面的例子中以Employee作为共有列进行连接,在DataFrame_1中Jack出现了两次,所以会将DataFrame_2中唯一的Jack后面的值重复连接到DataFrame_1中两个’Jack’后面去
多对多连接
多对多连接是指作为共有的列在两个DataFrame对象中都具有重复的值,对应的操作就是将两者同时扩充后连接
例如,以Employee为共有列,第一个DataFrame中Employee列的值为[‘A’,‘A’,‘B’,‘C’],第二个DataFrame对象中Employee列的值为[‘A’,‘B’,‘C’,‘C’]
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim','Jack'],'Group':['Accounting','Engineering','Engineering','HR','CEO']},index=list('ABCDE'))
DataFrame_2=pd.DataFrame({'Employee':['Kim','Jack','Sue','Suzy','Kim'],'HireDate':['2020','2019','2021','2018','2020']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2))
>>>
Employee Group
A Jack Accounting
B Sue Engineering
C Suzy Engineering
D Kim HR
E Jack CEO
Employee HireDate
0 Kim 2020
1 Jack 2019
2 Sue 2021
3 Suzy 2018
4 Kim 2020
Employee Group HireDate
0 Jack Accounting 2019
1 Jack CEO 2019
2 Sue Engineering 2021
3 Suzy Engineering 2018
4 Kim HR 2020
5 Kim HR 2020
在上面的例子中,连接的共有列是Employee,DataFrame_1中Employee的值为[‘Jack’,‘Sue’,‘Suzy’,‘Kim’,‘Jack’],其中Jack重复了两次,DataFrame_2中Employee的值为[‘Kim’,‘Jack’,‘Sue’,‘Suzy’,‘Kim’],其中Kim重复了两次
对应的操作就是将DataFrame_2中唯一的Jack重复复制了两次,分别连接到DataFrame_1中的两个Jack后面去,同理,DataFrame_1中唯一的Kim重复复制了两次,分别连接到DataFrame_2中的两个Kim后面去
假如两个DataFrame对象中重复的都是Jack,那么会出现下面的结果
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim','Jack'],'Group':['Accounting','Engineering','Engineering','HR','CEO']},index=list('ABCDE'))
DataFrame_2=pd.DataFrame({'Employee':['Kim','Jack','Sue','Suzy','Jack'],'HireDate':['2020','2019','2021','2018','2020']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2))
>>>
Employee Group
A Jack Accounting
B Sue Engineering
C Suzy Engineering
D Kim HR
E Jack CEO
Employee HireDate
0 Kim 2020
1 Jack 2019
2 Sue 2021
3 Suzy 2018
4 Jack 2020
Employee Group HireDate
0 Jack Accounting 2019
1 Jack Accounting 2020
2 Jack CEO 2019
3 Jack CEO 2020
4 Sue Engineering 2021
5 Suzy Engineering 2018
6 Kim HR 2020
按照同样的思路,DataFrame_1中的作为会计的Jack会复制两次连接到DataFrame_2中的两个Jack中去,于是就有了
0 Jack Accounting 2019
1 Jack Accounting 2020
然后作为CEO的Jack也会复制两次到DataFrame_2中的两个Jack去,就有了
2 Jack CEO 2019
3 Jack CEO 2020
接下来将DataFrame_2中的第一个Jack再复制到第一个DataFarme_1中的Jack去,有了
0 Jack Accounting 2019
2 Jack CEO 2019
同理,将第二个Jack再复制连接到第一个DataFrame中的Jack,就有了
1 Jack Accounting 2020
3 Jack CEO 2020
所以结合以上的步骤,最终的得到如上的表
设置数据合并的键
前面介绍的是pd.merge函数的默认规则,但是为了能够更好使用pd.merge函数来合并数据集,我们实际上可以指定pd.merge函数的一些参数来完成我们指定的操作
我们将讲解pd.merge的一些参数
on参数
on参数用于设置作为合并的列名,可以为其传递一个列名字符串或者列名列表
在当on参数缺失的时候,默认会以第一个匹配的列名作为合并的键
并且如果我们指定的列名并不是两个数据表共有的,就会报错
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','Jack','Suzy'],'HireDate':[2020,2019,2021,2018]})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2,on='Employee'))
>>>
Employee Group
0 Jack Accounting
1 Sue Engineering
2 Suzy Engineering
3 Kim HR
Employee HireDate
0 Sue 2020
1 Kim 2019
2 Jack 2021
3 Suzy 2018
Employee Group HireDate
0 Jack Accounting 2021
1 Sue Engineering 2020
2 Suzy Engineering 2018
3 Kim HR 2019
注意,传递给on参数的值必须是两个对象中共有的列名,否则会报错
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','Jack','Suzy'],'HireDate':[2020,2019,2021,2018]})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2,on='Group'))
>>>
Employee Group
0 Jack Accounting
1 Sue Engineering
2 Suzy Engineering
3 Kim HR
Employee HireDate
0 Sue 2020
1 Kim 2019
2 Jack 2021
3 Suzy 2018
Traceback (most recent call last):
File "TryPandas.py", line 535, in <module>
print(pd.merge(DataFrame_1,DataFrame_2,on='Group'))
File "/home/jackwang/.pyenv/versions/3.6.8/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 86, in merge
validate=validate,
File "/home/jackwang/.pyenv/versions/3.6.8/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 627, in __init__
) = self._get_merge_keys()
File "/home/jackwang/.pyenv/versions/3.6.8/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 983, in _get_merge_keys
right_keys.append(right._get_label_or_level_values(rk))
File "/home/jackwang/.pyenv/versions/3.6.8/lib/python3.6/site-packages/pandas/core/generic.py", line 1692, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'Group'
left_on与right_on参数
默认情况下,merge函数将会匹配两个数据表中同名的列,然后以这个列中的值为键来连接两个表的数据
但是我们可以指定left_on与right_on参数来指定左右两个表中用来作为键的列名
例如数据表A拼接在数据表B的左边,数据表B拼接在数据表A的右边,则left_on指定A表的列名,right_on指定B表的列名
通常指定left_on与right_on参数用于两个数据表中没有同名但是有同值的列,这个时候我们先将两个数据表按照值的对应(包括一对一,多对一和多对多)拼接在一起,然后使用drop方法删去不需要的列
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Names':['Sue','Kim','Jack','Suzy'],'HireDate':[2020,2019,2021,2018]})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_2,DataFrame_1,left_on='Names',right_on='Employee'))
print(pd.merge(DataFrame_1,DataFrame_2,left_on='Employee',right_on='Names'))
print(pd.merge(DataFrame_1,DataFrame_2,left_on='Employee',right_on='Names').drop('Names',axis=1))
>>>
Employee Group
0 Jack Accounting
1 Sue Engineering
2 Suzy Engineering
3 Kim HR
Names HireDate
0 Sue 2020
1 Kim 2019
2 Jack 2021
3 Suzy 2018
Names HireDate Employee Group
0 Sue 2020 Sue Engineering
1 Kim 2019 Kim HR
2 Jack 2021 Jack Accounting
3 Suzy 2018 Suzy Engineering
Employee Group Names HireDate
0 Jack Accounting Jack 2021
1 Sue Engineering Sue 2020
2 Suzy Engineering Suzy 2018
3 Kim HR Kim 2019
Employee Group HireDate
0 Jack Accounting 2021
1 Sue Engineering 2020
2 Suzy Engineering 2018
3 Kim HR 2019
Left_index和Right_index参数
前面讲过,使用pd.merge函数时候得到的拼接后的列默认会丢弃掉原来两个数据表的行索引(index),我们可以指定left_index和right_index来保留原有的索引.
我们前面在层级索引中讲过,对于DataFrame对象,我们可以使用reset_index和set_index方法来重置或者设置行索引名,结合这个方法,我们常用的操作是读取文件,然后将其中的一列作为index
这个时候我们还想要合并两个数据集的话,直接使用merge函数会造成索引的丢弃,例如
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Names':['Sue','Kim','Jack','Suzy'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1.set_index('Group'))
print(DataFrame_2.set_index('HireDate'))
print(pd.merge(DataFrame_1.set_index('Group'),DataFrame_2.set_index('HireDate'),left_on='Employee',right_on='Names'))
>>>
Employee
Group
Accounting Jack
Engineering Sue
Engineering Suzy
HR Kim
Names
HireDate
2020 Sue
2019 Kim
2021 Jack
2018 Suzy
Employee Names
0 Jack Jack
1 Sue Sue
2 Suzy Suzy
3 Kim Kim
首先需要注意的,上面的例子中把两个数据表的Group和HireDate作为行索引(index),这是因为如果把Employee和Names作为行索引(index),那么我们拼接的时候没有具有共同的值的列作为键,这样会引发后面会讲到的DataFrame合并时的另外的规则,结果就是返回一个空DataFrame对象
为此,我们可以使用left_index和right_index来指定保留行索引
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','Jack','Suzy'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1.set_index('Employee'))
print(DataFrame_2.set_index('Employee'))
print(pd.merge(DataFrame_1.set_index('Employee'),DataFrame_2.set_index('Employee'),\
left_index=True,right_index=True))
>>>
Group
Employee
Jack Accounting
Sue Engineering
Suzy Engineering
Kim HR
HireDate
Employee
Sue 2020
Kim 2019
Jack 2021
Suzy 2018
Group HireDate
Employee
Jack Accounting 2021
Sue Engineering 2020
Suzy Engineering 2018
Kim HR 2019
注意,我们设置完行索引(index)后,如果两个数据表的行索引的值不同,即便有相同的值的列,依旧会得到一个空表,这依旧是因为触发了后面会讲到的数据连接时的集合操作规则
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','Jack','Suzy'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1.set_index('Group'))
print(DataFrame_2.set_index('HireDate'))
print(pd.merge(DataFrame_1.set_index('Group'),DataFrame_2.set_index('HireDate'),\
left_index=True,right_index=True))
>>>
Employee
Group
Accounting Jack
Engineering Sue
Engineering Suzy
HR Kim
Employee
HireDate
2020 Sue
2019 Kim
2021 Jack
2018 Suzy
Empty DataFrame
Columns: [Employee_x, Employee_y]
Index: []
join方法
实际上我们想要实现合并索引的效果,除了上面讲的方法外,我们也可以直接调用join方法
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','Jack','Suzy'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1.set_index('Employee'))
print(DataFrame_2.set_index('Employee'))
print(DataFrame_1.set_index('Employee').join(DataFrame_2.set_index('Employee')))
>>>
Group
Employee
Jack Accounting
Sue Engineering
Suzy Engineering
Kim HR
HireDate
Employee
Sue 2020
Kim 2019
Jack 2021
Suzy 2018
Group HireDate
Employee
Jack Accounting 2021
Sue Engineering 2020
Suzy Engineering 2018
Kim HR 2019
设置数据连接的集合操作规则
前面只是讲解了数据连接的类型以及pd.merge函数中可用的参数,但是真正想要高效的进行数据连接,我们需要明白集合操作规则
我们在前面的几个例子额外提到的效果其实都是因为数据连接时的集合操作规则
我们将连接的数据表(这里是两个DataFrame对象)分为左右两张表
内连接是指只连接两张表中匹配的内容
外连接是指连接两张表中全部的内容
左连接是指在右表中匹配并连接与左表匹配的内容
右连接是指在左表中匹配并连接与右表匹配的内容
实现具体的连接方式就是指定pd.merge函数的how参数
内连接
内连接是指连接两个对象返回两者的交集的连接方式,我们可以指定pd.merge函数的how参数为inner设定内连接,how参数缺失时默认为inner
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','May','John'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2,how='inner'))
>>>
Employee Group
0 Jack Accounting
1 Sue Engineering
2 Suzy Engineering
3 Kim HR
Employee HireDate
0 Sue 2020
1 Kim 2019
2 May 2021
3 John 2018
Employee Group HireDate
0 Sue Engineering 2020
1 Kim HR 2019
上面的例子中,我们指定了用于合并的列是Employee这一列,那么在指定这一列中将会搜寻共有的元素,即Sue和Kim
注意,判断某一行是否两个表的公共列时候,如果我们指定了行索引的话,也会将行索引纳入判断的范围,例如
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','Jack','Suzy'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1.set_index('Group'))
print(DataFrame_2.set_index('HireDate'))
print(pd.merge(DataFrame_1.set_index('Group'),DataFrame_2.set_index('HireDate'),\
left_index=True,right_index=True,left_on='Employee',right_on='Employee',how='inner'))
>>>
Employee
Group
Accounting Jack
Engineering Sue
Engineering Suzy
HR Kim
Employee
HireDate
2020 Sue
2019 Kim
2021 Jack
2018 Suzy
Empty DataFrame
Columns: [Employee, Employee_x, Employee_y]
Index: []
所以为了避免出现这样的麻烦,我们进行数据分析前的数据预处理过程就是:
- 加载数据
- 合并数据
- 删去不需要的列
- 填充缺失值
- 设置行列索引名
外连接
与内连接相对,外连接只是返回的结果包含所有输入列,即返回输入的并集,缺失值用NaN代替
外连接需要我们指定how参数为outer
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','May','John'],'HireDate':['2020','2019','2021','2018']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2,how='outer'))
>>>
Employee Group
0 Jack Accounting
1 Sue Engineering
2 Suzy Engineering
3 Kim HR
Employee HireDate
0 Sue 2020
1 Kim 2019
2 May 2021
3 John 2018
Employee Group HireDate
0 Jack Accounting NaN
1 Sue Engineering 2020
2 Suzy Engineering NaN
3 Kim HR 2019
4 May NaN 2021
5 John NaN 2018
左右连接
除了内外连接之外,我们还可以指定how参数为left / right,分别对应左右连接
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','May','Suzy','John','Jack'],'HireDate':['2020','2019','2021','2018','2017','2016']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2,how='left'))
print(pd.merge(DataFrame_1,DataFrame_2,how='right'))
>>>
Employee Group
0 Jack Accounting
1 Sue Engineering
2 Suzy Engineering
3 Kim HR
Employee HireDate
0 Sue 2020
1 Kim 2019
2 May 2021
3 Suzy 2018
4 John 2017
5 Jack 2016
Employee Group HireDate
0 Jack Accounting 2016
1 Sue Engineering 2020
2 Suzy Engineering 2018
3 Kim HR 2019
Employee Group HireDate
0 Jack Accounting 2016
1 Sue Engineering 2020
2 Suzy Engineering 2018
3 Kim HR 2019
4 May NaN 2021
5 John NaN 2017
重复列名
有的时候我们需要连接的两个DataFrame对象具有重复的列名,pd.merge函数默认会为两个重复的列名自动添加后缀,但是我们可以使用suffixes参数来指定后缀
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','May','Suzy','John','Jack'],'Group':['2020','2019','2021','2018','2017','2016']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2,on='Employee'))
>>>
Employee Group
0 Jack Accounting
1 Sue Engineering
2 Suzy Engineering
3 Kim HR
Employee Group
0 Sue 2020
1 Kim 2019
2 May 2021
3 Suzy 2018
4 John 2017
5 Jack 2016
Employee Group_x Group_y
0 Jack Accounting 2016
1 Sue Engineering 2020
2 Suzy Engineering 2018
3 Kim HR 2019
上面的例子中我们指定以Employee为键来合并两个数组,但是有两个名字同为Group的列,这时候Pandas默认会为两者添加后缀_x和_y
我们指定suffixes参数来指定后缀
DataFrame_1=pd.DataFrame({'Employee':['Jack','Sue','Suzy','Kim'],'Group':['Accounting','Engineering','Engineering','HR']})
DataFrame_2=pd.DataFrame({'Employee':['Sue','Kim','May','Suzy','John','Jack'],'Group':['2020','2019','2021','2018','2017','2016']})
print(DataFrame_1)
print(DataFrame_2)
print(pd.merge(DataFrame_1,DataFrame_2,on='Employee',suffixes=['_Left','_Right']))
>>>
Employee Group
0 Jack Accounting
1 Sue Engineering
2 Suzy Engineering
3 Kim HR
Employee Group
0 Sue 2020
1 Kim 2019
2 May 2021
3 Suzy 2018
4 John 2017
5 Jack 2016
Employee Group_Left Group_Right
0 Jack Accounting 2016
1 Sue Engineering 2020
2 Suzy Engineering 2018
3 Kim HR 2019

魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。
更多推荐
所有评论(0)