Compare 2 columns in different data frames with primary key conditions without merging… here is a solution to the problem.
Compare 2 columns in different data frames with primary key conditions without merging
I have 2 different data frames, for example:
Df1:
User_id User_name User_phn
1 Alex 1234123
2 Danny 4234123
3 Bryan 5234123
Df2:
User_id User_name User_phn
1 Alex 3234123
2 Chris 4234123
3 Bryan 5234123
4 Bexy 6234123
user_id is the primary key in both tables, and I need to compare two dataframes using user_id as a condition and get values with matching and mismatched values without merging the dataframes into a new dataframe. We’re going to be dealing with over 100 million records with huge datasets, which is why I don’t want to merge into a new data frame again, which I think consumes memory again.
Result:
User_id User_name User_phn
1 Alex Mismatch
2 Mismatch 4234123
3 Bryan 5234123
4 Mismatch Mismatch
Solution
Not easy, but you can pass by comparing the series
of tuples created by the combination of columns isin
to compare:
s11 = pd. Series(list(map(tuple, Df1[['User_id','User_name']].values.tolist())))
s12 = pd. Series(list(map(tuple, Df2[['User_id','User_name']].values.tolist())))
s21 = pd. Series(list(map(tuple, Df1[['User_id','User_phn']].values.tolist())))
s22 = pd. Series(list(map(tuple, Df2[['User_id','User_phn']].values.tolist())))
Df2.loc[~s12.isin(s11), 'User_name'] = 'Mismatch'
Df2.loc[~s22.isin(s21), 'User_phn'] = 'Mismatch'
print (Df2)
User_id User_name User_phn
0 1 Alex Mismatch
1 2 Mismatch 4234123
2 3 Bryan 5234123
3 4 Mismatch Mismatch
Solution merge
>isna test mismatched pair (missing value):
s1 = Df2.merge(Df1, how='left', on=['User_id','User_name'], suffixes=('_',''))['User_phn']
print (s1)
0 1234123.0
1 NaN
2 5234123.0
3 NaN
Name: User_phn, dtype: float64
s2 = Df2.merge(Df1, how='left', on=['User_id','User_phn'], suffixes=('_',''))['User_name']
print (s2)
0 NaN
1 Danny
2 Bryan
3 NaN
Name: User_name, dtype: object
Df2.loc[s1.isna(), 'User_name'] = 'Mismatch'
Df2.loc[s2.isna(), 'User_phn'] = 'Mismatch'
print (Df2)
User_id User_name User_phn
0 1 Alex Mismatch
1 2 Mismatch 4234123
2 3 Bryan 5234123
3 4 Mismatch Mismatch