Find common values in columns that contain a list of items
I have a dataset with several columns that are a list of items. I have given an example below. I’m trying to find an entry in the list that has a 100% match. I want to find those with 90% or less.
>>> df2 = pd. DataFrame({ 'ID':['1', '2', '3', '4', '5', '6', '7', '8'], 'Productdetailed': [['Phone', 'Watch', 'Pen'], ['Pencil', 'fork', 'Eraser'], ['Apple', 'Mango', 'Orange'], ['Something', ' Nothing', 'Everything'], ['Eraser', 'fork', 'Pencil'], ['Phone', 'Watch', 'Pen'],['Apple', 'Mango'], ['Pen', 'Phone', 'Watch']]})
>>> df2
ID Productdetailed
0 1 [Phone, Watch, Pen]
1 2 [Pencil, fork, Eraser]
2 3 [Apple, Mango, Orange]
3 4 [Something, Nothing, Everything]
4 5 [Eraser, fork, Pencil]
5 6 [Phone, Watch, Pen]
6 7 [Apple, Mango]
7 8 [Pen, Phone, Watch]
If you notice index 0 and index 7 in DF2
, they have the same set of items but in different order. where index 0 and index 5 have the same items in the same order. I want to treat them as matches. I tried groupby
and series.isin().
I also tried intersecting by splitting the dataset into two parts, but it failed with the wrong type.
First, I want to count the number of exact matches (number of matching rows) and the row index number it matches. However, when items such as index 2 and index 6 in DF2 match only partially. I want to say the percentage of matches and the corresponding column number.
I mentioned it. I’m trying to split the data for a specific column value into two parts. And then
applied df2['Intersection'] =
[list(set(a).intersection(set(b)))
for a, b in zip(df2_part1. Productdetailed, df2_part2. Productdetailed)
]
, where a
and b
are fragments of df2_part1
and in the
Productdetailed
column df2_part2
Is there a way to do this? Please help
Solution
To know the exact match:
df2["Productdetailed"]=df2["Productdetailed"].sort_values()
# create new colum from the sorted list. More easy to work with pivot table
df2['Productdetailed_str'] = df2['Productdetailed'].apply(lambda x: ', '.join(x))
df2["hit"] = 1
df3 = (df2.pivot_table(index=["Productdetailed_str"],
values=["ID", "hit"],
aggfunc={'ID': lambda x: ', '.join(x), 'hit': 'sum'}
))
A hit is the number of occurrences.
Result df3:
ID hit
Productdetailed_str
Apple, Mango 7 1
Apple, Mango, Orange 3 1
Eraser, fork, Pencil 5 1
Pen, Phone, Watch 8 1
Pencil, fork, Eraser 2 1
Phone, Watch, Pen 1, 6 2
Something, Nothing, Everything 4 1
Partial matching is difficult, but you can start splitting the list and using a pivot table:
test = df2.apply(lambda x: pd. Series(x['Productdetailed']),axis=1).stack().reset_index(level=1, drop=True).to_frame(name='list').join(df2)
If you run the test. You have words in the Productdetailed Columns list in List Columns. Also, you have the ID… So I think you can extract the information using pivot tables….