Python – pandas join DF – merge and join different semantics

pandas join DF – merge and join different semantics… here is a solution to the problem.

pandas join DF – merge and join different semantics

I want to include 2 DFs in pandas. Some columns are int or float, others are categories. (Do not perform the same cat code/index for categories in A and B df)
Their public columns are lists of float and category columns of size 8.

How to join

df_a.merge(df_b, how='inner'), on=join_columns )

No results are returned at all. And by

joining

df_a.join(df_b, lsuffix='_l', rsuffix='_r')

Seems to work.

But I’m a bit confused as to why it failed, which could be wrong if I shouldn’t convert all columns to objects to prevent joining via cat code.

That is, if you select left as the join method for merge, the joined columns will contain only NAN values. Unfortunately, I’m not quite sure how to build a useful minimal example.

Edit

Here is an example

import pandas as pd

raw_data = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'name': ['A', 'B', 'C', 'D', 'E'],
        'nationality': ['DE', 'AUT', 'US', 'US', 'US'],
        'age_group' : [1, 2, 1, 3, 1]}
df_a = pd. DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'age_group'])
df_a.nationality = df_a.nationality.astype('category')
df_a

raw_data = {
        'subject_id': ['1', '2', '3' ],
        'name': ['Billy', 'Brian', 'Bran'],
        'nationality': ['DE', 'US', 'US'],
        'age_group' : [1, 1, 3],
        'average_return_per_group' : [1.5, 2.3, 1.4]}
df_b = pd. DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'age_group', 'average_return_per_group'])
df_b.nationality = df_b.nationality.astype('category')
df_b

# some result is joined
df_a.join(df_b, lsuffix='_l', rsuffix='_r') 

# this *fails* as only NULL values joined, or nor result for inner join
df_a.merge(df_b, how='left', on=['nationality', 'age_group'])

Solution

By default, join is joined along the index, and merge is along columns with the same name.

Check this:

In [115]: df_a.join(df_b, lsuffix='_l', rsuffix='_r')
Out[115]:
  subject_id_l name_l nationality_l  age_group_l subject_id_r name_r nationality_r  age_group_r average_returns_per_group
0            1      A            DE            1            1  Billy            DE          1.0                       NaN
1            2      B           AUT            2            2  Brian            US          1.0                       NaN
2            3      C            US            1            3   Bran            US          3.0                       NaN
3            4      D            US            3          NaN    NaN           NaN          NaN                       NaN
4            5      E            US            1          NaN    NaN           NaN          NaN                       NaN

Let’s set [‘a’,’b','c'] as the index in df_b and try to join again – you’ll only see NaN in total*_r columns:

In [116]: df_a.join(df_b.set_index(pd. Index(['a','b','c'])), lsuffix='_l', rsuffix='_r')
Out[116]:
  subject_id_l name_l nationality_l  age_group_l subject_id_r name_r nationality_r  age_group_r average_returns_per_group
0            1      A            DE            1          NaN    NaN           NaN          NaN                       NaN
1            2      B           AUT            2          NaN    NaN           NaN          NaN                       NaN
2            3      C            US            1          NaN    NaN           NaN          NaN                       NaN
3            4      D            US            3          NaN    NaN           NaN          NaN                       NaN
4            5      E            US            1          NaN    NaN           NaN          NaN                       NaN

In [117]: df_b.set_index(pd. Index(['a','b','c']))
Out[117]:
  subject_id   name nationality  age_group average_returns_per_group
a          1  Billy          DE          1                       NaN
b          2  Brian          US          1                       NaN
c          3   Bran          US          3                       NaN

Update: IMO merge Works as expected (described in the documentation).

In [151]: df_a.merge(df_b, on=['nationality', 'age_group'], how='left', suffixes=['_l','_r'])
Out[151]:
  subject_id_l name_l nationality  age_group subject_id_r name_r  average_return_per_group
0            1      A          DE          1            1  Billy                       1.5
1            2      B         AUT          2          NaN    NaN                       NaN
2            3      C          US          1            2  Brian                       2.3
3            4      D          US          3            3   Bran                       1.4
4            5      E          US          1            2  Brian                       2.3

Related Problems and Solutions