pandas join DF – merge and join different semantics
I want to include 2 DFs in pandas. Some columns are int or float, others are categories. (Do not perform the same cat code/index for categories in A and B df)
Their public columns are lists of float and category columns of size 8.
How to join
df_a.merge(df_b, how='inner'), on=join_columns )
No results are returned at all. And by
joining
df_a.join(df_b, lsuffix='_l', rsuffix='_r')
Seems to work.
But I’m a bit confused as to why it failed, which could be wrong if I shouldn’t convert all columns to objects to prevent joining via cat code.
That is, if you select left
as the join method for merge
, the joined columns will contain only NAN
values. Unfortunately, I’m not quite sure how to build a useful minimal example.
Edit
Here is an example
import pandas as pd
raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'name': ['A', 'B', 'C', 'D', 'E'],
'nationality': ['DE', 'AUT', 'US', 'US', 'US'],
'age_group' : [1, 2, 1, 3, 1]}
df_a = pd. DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'age_group'])
df_a.nationality = df_a.nationality.astype('category')
df_a
raw_data = {
'subject_id': ['1', '2', '3' ],
'name': ['Billy', 'Brian', 'Bran'],
'nationality': ['DE', 'US', 'US'],
'age_group' : [1, 1, 3],
'average_return_per_group' : [1.5, 2.3, 1.4]}
df_b = pd. DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'age_group', 'average_return_per_group'])
df_b.nationality = df_b.nationality.astype('category')
df_b
# some result is joined
df_a.join(df_b, lsuffix='_l', rsuffix='_r')
# this *fails* as only NULL values joined, or nor result for inner join
df_a.merge(df_b, how='left', on=['nationality', 'age_group'])
Solution
By default, join
is joined along the index, and merge
is along columns with the same name.
Check this:
In [115]: df_a.join(df_b, lsuffix='_l', rsuffix='_r')
Out[115]:
subject_id_l name_l nationality_l age_group_l subject_id_r name_r nationality_r age_group_r average_returns_per_group
0 1 A DE 1 1 Billy DE 1.0 NaN
1 2 B AUT 2 2 Brian US 1.0 NaN
2 3 C US 1 3 Bran US 3.0 NaN
3 4 D US 3 NaN NaN NaN NaN NaN
4 5 E US 1 NaN NaN NaN NaN NaN
Let’s set [‘a’,’b','c']
as the index in df_b
and try to join again – you’ll only see NaN
in total*_r
columns:
In [116]: df_a.join(df_b.set_index(pd. Index(['a','b','c'])), lsuffix='_l', rsuffix='_r')
Out[116]:
subject_id_l name_l nationality_l age_group_l subject_id_r name_r nationality_r age_group_r average_returns_per_group
0 1 A DE 1 NaN NaN NaN NaN NaN
1 2 B AUT 2 NaN NaN NaN NaN NaN
2 3 C US 1 NaN NaN NaN NaN NaN
3 4 D US 3 NaN NaN NaN NaN NaN
4 5 E US 1 NaN NaN NaN NaN NaN
In [117]: df_b.set_index(pd. Index(['a','b','c']))
Out[117]:
subject_id name nationality age_group average_returns_per_group
a 1 Billy DE 1 NaN
b 2 Brian US 1 NaN
c 3 Bran US 3 NaN
Update: IMO merge Works as expected (described in the documentation).
In [151]: df_a.merge(df_b, on=['nationality', 'age_group'], how='left', suffixes=['_l','_r'])
Out[151]:
subject_id_l name_l nationality age_group subject_id_r name_r average_return_per_group
0 1 A DE 1 1 Billy 1.5
1 2 B AUT 2 NaN NaN NaN
2 3 C US 1 2 Brian 2.3
3 4 D US 3 3 Bran 1.4
4 5 E US 1 2 Brian 2.3