Python – Compare 2 columns in different data frames with primary key conditions without merging

Compare 2 columns in different data frames with primary key conditions without merging… here is a solution to the problem.

Compare 2 columns in different data frames with primary key conditions without merging

I have 2 different data frames, for example:

Df1:

User_id    User_name     User_phn
1          Alex          1234123
2          Danny         4234123
3          Bryan         5234123

Df2:

User_id    User_name     User_phn
1          Alex          3234123
2          Chris         4234123
3          Bryan         5234123
4          Bexy          6234123

user_id is the primary key in both tables, and I need to compare two dataframes using user_id as a condition and get values with matching and mismatched values without merging the dataframes into a new dataframe. We’re going to be dealing with over 100 million records with huge datasets, which is why I don’t want to merge into a new data frame again, which I think consumes memory again.

Result:

User_id    User_name     User_phn
1          Alex          Mismatch
2          Mismatch      4234123
3          Bryan         5234123
4          Mismatch      Mismatch

Solution

Not easy, but you can pass by comparing the series of tuples created by the combination of columns isin to compare:

s11 = pd. Series(list(map(tuple, Df1[['User_id','User_name']].values.tolist())))
s12 = pd. Series(list(map(tuple, Df2[['User_id','User_name']].values.tolist())))

s21 = pd. Series(list(map(tuple, Df1[['User_id','User_phn']].values.tolist())))
s22 = pd. Series(list(map(tuple, Df2[['User_id','User_phn']].values.tolist())))

Df2.loc[~s12.isin(s11), 'User_name'] = 'Mismatch'
Df2.loc[~s22.isin(s21), 'User_phn'] = 'Mismatch'

print (Df2)
   User_id User_name  User_phn
0        1      Alex  Mismatch
1        2  Mismatch   4234123
2        3     Bryan   5234123
3        4  Mismatch  Mismatch

Solution merge >isna test mismatched pair (missing value):

s1 = Df2.merge(Df1, how='left', on=['User_id','User_name'], suffixes=('_',''))['User_phn']
print (s1)
0    1234123.0
1          NaN
2    5234123.0
3          NaN
Name: User_phn, dtype: float64

s2 = Df2.merge(Df1, how='left', on=['User_id','User_phn'], suffixes=('_',''))['User_name']
print (s2)
0      NaN
1    Danny
2    Bryan
3      NaN
Name: User_name, dtype: object

Df2.loc[s1.isna(), 'User_name'] = 'Mismatch'
Df2.loc[s2.isna(), 'User_phn'] = 'Mismatch'

print (Df2)
   User_id User_name  User_phn
0        1      Alex  Mismatch
1        2  Mismatch   4234123
2        3     Bryan   5234123
3        4  Mismatch  Mismatch

Related Problems and Solutions