Python - pandas is equivalent to a few columns of group by and STRING_AGG

pandas is equivalent to a few columns of group by and STRING_AGG… here is a solution to the problem.

pandas is equivalent to a few columns of group by and STRING_AGG

Is there a way to group DataFrame objects and then put all entries into a collection (or a list with filtered unique values) for the remaining columns. So like this

   Name        Date  Amount   purchase
0  Jack  2016-01-31      10    'apple'
1  Jack  2016-02-29       5        25
2  Jack  2016-02-29       8       'B+'
3  Jill  2016-01-31      10       'D2'
4  Jill  2016-02-29       5         E
4  Jill  2016-02-29       5         E

Output after grouping by the first two columns.

    Name        Date  Amount     purchase
 0  Jack  2016-01-31     [10]     [apple]
 1  Jack  2016-02-29    [5,8]   [25,'B+']
 3  Jill  2016-01-31     [10]      ['D2']
 4  Jill  2016-02-29      [5]       ['E']

So I can do this for each column with df_data = df.groupby(['Name', 'Date'])['Amount'].apply(set) for each column and then concatenate them, but is there a shorter and more elegant solution if the list is long?

Solution

Don’t do it

Pandas was never designed to hold lists as series/columns. You can develop costly workarounds, but these are not recommended.

The

main reason why saving lists in series is deprecated is that you lose the vectorization ability to use NumPy arrays to save contiguous blocks of memory. Your series will be object dtype, which represents a series of pointers, much like a list.

Of course, object dtypes are unavoidable for mixed types. Here, however, nested pointer structures complicate the effect even more. You will lose memory and performance benefits, and you will not be able to access optimized Pandas methods.

Slightly better option

Can be aggregated into a string so that there is only one layer of pointers:

res = df.groupby(['Name', 'Date'], as_index=False)[['Amount', 'purchase']]\
        .agg(lambda x: ', '.join(map(str, set(x))))

print(res)
    
Name        Date Amount  purchase
0  Jack  2016-01-31     10   'apple'
1  Jack  2016-02-29   8, 5  'B+', 25
2  Jill  2016-01-31     10      'D2'
3  Jill  2016-02-29      5         E

Python – pandas is equivalent to a few columns of group by and STRING_AGG

pandas is equivalent to a few columns of group by and STRING_AGG

Solution

Don’t do it

Slightly better option

Related Problems and Solutions