Python – Find common values in columns that contain a list of items

Find common values in columns that contain a list of items… here is a solution to the problem.

Find common values in columns that contain a list of items

I have a dataset with several columns that are a list of items. I have given an example below. I’m trying to find an entry in the list that has a 100% match. I want to find those with 90% or less.

>>> df2 = pd. DataFrame({ 'ID':['1', '2', '3', '4', '5', '6', '7', '8'], 'Productdetailed': [['Phone', 'Watch', 'Pen'], ['Pencil', 'fork', 'Eraser'], ['Apple', 'Mango', 'Orange'], ['Something', ' Nothing', 'Everything'], ['Eraser', 'fork', 'Pencil'], ['Phone', 'Watch', 'Pen'],['Apple', 'Mango'], ['Pen', 'Phone', 'Watch']]})

>>> df2
ID                   Productdetailed
0  1               [Phone, Watch, Pen]
1  2            [Pencil, fork, Eraser]
2  3            [Apple, Mango, Orange]
3  4  [Something, Nothing, Everything]
4  5            [Eraser, fork, Pencil]
5  6               [Phone, Watch, Pen]
6  7                    [Apple, Mango]
7  8               [Pen, Phone, Watch]

If you notice index 0 and index 7 in DF2, they have the same set of items but in different order. where index 0 and index 5 have the same items in the same order. I want to treat them as matches. I tried groupby and series.isin(). I also tried intersecting by splitting the dataset into two parts, but it failed with the wrong type.

First, I want to count the number of exact matches (number of matching rows) and the row index number it matches. However, when items such as index 2 and index 6 in DF2 match only partially. I want to say the percentage of matches and the corresponding column number.

I mentioned it. I’m trying to split the data for a specific column value into two parts. And then

applied df2['Intersection'] = 
     [list(set(a).intersection(set(b))) 
         for a, b in zip(df2_part1. Productdetailed, df2_part2. Productdetailed)
     ]

, where a and b are fragments of df2_part1 and in the Productdetailed column df2_part2

Is there a way to do this? Please help

Solution

To know the exact match:

df2["Productdetailed"]=df2["Productdetailed"].sort_values()
# create new colum from the sorted list. More easy to work with pivot table
df2['Productdetailed_str'] = df2['Productdetailed'].apply(lambda x: ', '.join(x))
df2["hit"] = 1
df3 = (df2.pivot_table(index=["Productdetailed_str"],
                 values=["ID", "hit"],
                aggfunc={'ID': lambda x: ', '.join(x), 'hit': 'sum'}
               ))

A hit is the number of occurrences.
Result df3:

                                  ID  hit
Productdetailed_str                      
Apple, Mango                       7    1
Apple, Mango, Orange               3    1
Eraser, fork, Pencil               5    1
Pen, Phone, Watch                  8    1
Pencil, fork, Eraser               2    1
Phone, Watch, Pen               1, 6    2
Something, Nothing, Everything     4    1

Partial matching is difficult, but you can start splitting the list and using a pivot table:

test = df2.apply(lambda x: pd. Series(x['Productdetailed']),axis=1).stack().reset_index(level=1, drop=True).to_frame(name='list').join(df2)

If you run the test. You have words in the Productdetailed Columns list in List Columns. Also, you have the ID… So I think you can extract the information using pivot tables….

Related Problems and Solutions