pandas is equivalent to a few columns of group by and STRING_AGG
Is there a way to group DataFrame objects and then put all entries into a collection (or a list with filtered unique values) for the remaining columns. So like this
Name Date Amount purchase
0 Jack 2016-01-31 10 'apple'
1 Jack 2016-02-29 5 25
2 Jack 2016-02-29 8 'B+'
3 Jill 2016-01-31 10 'D2'
4 Jill 2016-02-29 5 E
4 Jill 2016-02-29 5 E
Output after grouping by the first two columns.
Name Date Amount purchase
0 Jack 2016-01-31 [10] [apple]
1 Jack 2016-02-29 [5,8] [25,'B+']
3 Jill 2016-01-31 [10] ['D2']
4 Jill 2016-02-29 [5] ['E']
So I can do this for each column with df_data = df.groupby(['Name', 'Date'])['Amount'].apply(set)
for each column and then concatenate them, but is there a shorter and more elegant solution if the list is long?
Solution
Don’t do it
Pandas was never designed to hold lists as series/columns. You can develop costly workarounds, but these are not recommended.
The
main reason why saving lists in series is deprecated is that you lose the vectorization ability to use NumPy arrays to save contiguous blocks of memory. Your series will be object
dtype, which represents a series of pointers, much like a list
.
Of course, object
dtypes are unavoidable for mixed types. Here, however, nested pointer structures complicate the effect even more. You will lose memory and performance benefits, and you will not be able to access optimized Pandas methods.
Slightly better option
Can be aggregated into a string so that there is only one layer of pointers:
res = df.groupby(['Name', 'Date'], as_index=False)[['Amount', 'purchase']]\
.agg(lambda x: ', '.join(map(str, set(x))))
print(res)
Name Date Amount purchase
0 Jack 2016-01-31 10 'apple'
1 Jack 2016-02-29 8, 5 'B+', 25
2 Jill 2016-01-31 10 'D2'
3 Jill 2016-02-29 5 E