Python – Pandas: How do I subset (and sum) the top N observations in a subcategory?

Pandas: How do I subset (and sum) the top N observations in a subcategory?… here is a solution to the problem.

Pandas: How do I subset (and sum) the top N observations in a subcategory?

In a pandas data frame like this:

     year      month  passengers
0    1949    January         112
1    1949   February         118
2    1949      March         132
3    1949      April         129
4    1949        May         121
5    1949       June         135
.
.
.
137  1960       June         535
138  1960       July         622
139  1960     August         606
140  1960  September         508
141  1960    October         461
142  1960   November         390
143  1960   December         432

How do I divide (and count) the maximum 3 months per year for passengers?

You can copy the exact same data frame from the seaborn dataset:

import pandas as pd
import seaborn as sns

df = sns.load_dataset('flights')
df

Here’s what I tried :

According to this Suggestion in post comment I found that after reindexing the dataframe, I can use nlargest() to subset the dataframe:

df = df.set_index(['year', 'month'])
df2 = df.groupby(level=0)['passengers'].nlargest(3)
df2

Output:

year  year  month    
1949  1949  July         148
            August       148
            September    136
1950  1950  July         170
            August       170

But for some reason, the year index is duplicated and I still need to re-index, group the dataframe by year and sum the results. This is starting to get confusing, so isn’t there a better way?

Here’s what a simple copy-paste is all about:

import pandas as pd
import seaborn as sns
df = sns.load_dataset('flights')
df = df.set_index(['year', 'month'])
df2 = df.groupby(level=0)['passengers'].nlargest(3)

This is the shape of the desired output:

# Sum of top 3 months for each year (no index other than default pandas dataframe index)

year    sum
0   1949    600 (the sum is made up)
1   1950    600
.
.
.
10  1960    600

Thanks for any suggestions!

Edit: System Information:

python 3.6.0

Pandas 0.19.2

Windows 7

Solution

I think < a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.apply.html" rel="noreferrer noopener nofollow" > GroupBy.apply , Series.nlargest and sum:

df = sns.load_dataset('flights')
df2 = df.groupby('year')['passengers'].apply(lambda x: x.nlargest(3).sum()).reset_index()
print (df2)
    year  passengers
0   1949         432
1   1950         498
2   1951         582
3   1952         690
4   1953         779
5   1954         859
6   1955        1026
7   1956        1192
8   1957        1354
9   1958        1431
10  1959        1579
11  1960        1763

Related Problems and Solutions