Is it possible to get the groupby style count without collapsing the Pandas DataFrame?
I
have a DataFrame with 9 columns and I’m trying to add a column based on the first 3 columns of unique value counts (e.g. columns A, B, and C, which must match to count as unique values, but the rest of the columns may be different. I tried to do this like I would with groupby :
df = pd. DataFrame(resultsFile500.groupby(['chr','start','end']).size().reset_index().rename(columns={0:'count'}))
This returns a DataFrame with 5 columns, and the count is what I want. However, I also need the values in the original data frame, so what I’ve been trying to do is somehow make these count values a column in the original df. Therefore, this means that if two rows in the chr
, start
, and end
columns have the same value, the counts column is 2 in both rows
, but they are not collapsed into one row. Is there an easy solution here that I’m missing or do I need to crack something together?
Solution
You can use .transform
to get non-folding behavior:
>>> df
a b c d e
0 3 4 1 3 0
1 3 1 4 3 0
2 4 3 3 2 1
3 3 4 1 4 0
4 0 4 3 3 2
5 1 2 0 4 1
6 3 1 4 2 1
7 0 4 3 4 0
8 1 3 0 1 1
9 3 4 1 2 1
>>> df.groupby(['a','b','c']).transform('count')
d e
0 3 3
1 2 2
2 1 1
3 3 3
4 2 2
5 1 1
6 2 2
7 2 2
8 1 1
9 3 3
>>>
Note that I have to select any column from the .transform
result and then just do the following:
>>> df['unique_count'] = df.groupby(['a','b','c']).transform('count')['d']
>>> df
a b c d e unique_count
0 3 4 1 3 0 3
1 3 1 4 3 0 2
2 4 3 3 2 1 1
3 3 4 1 4 0 3
4 0 4 3 3 2 2
5 1 2 0 4 1 1
6 3 1 4 2 1 2
7 0 4 3 4 0 2
8 1 3 0 1 1 1
9 3 4 1 2 1 3