Python – Pandas: How do I subset (and sum) the top N observations in a subcategory?

Pandas: How do I subset (and sum) the top N observations in a subcategory?… here is a solution to the problem.

Pandas: How do I subset (and sum) the top N observations in a subcategory?

In a pandas data frame like this:

     year      month  passengers
0    1949    January         112
1    1949   February         118
2    1949      March         132
3    1949      April         129
4    1949        May         121
5    1949       June         135
137  1960       June         535
138  1960       July         622
139  1960     August         606
140  1960  September         508
141  1960    October         461
142  1960   November         390
143  1960   December         432

How do I divide (and count) the maximum 3 months per year for passengers?

You can copy the exact same data frame from the seaborn dataset:

import pandas as pd
import seaborn as sns

df = sns.load_dataset('flights')

Here’s what I tried :

According to this Suggestion in post comment I found that after reindexing the dataframe, I can use nlargest() to subset the dataframe:

df = df.set_index(['year', 'month'])
df2 = df.groupby(level=0)['passengers'].nlargest(3)


year  year  month    
1949  1949  July         148
            August       148
            September    136
1950  1950  July         170
            August       170

But for some reason, the year index is duplicated and I still need to re-index, group the dataframe by year and sum the results. This is starting to get confusing, so isn’t there a better way?

Here’s what a simple copy-paste is all about:

import pandas as pd
import seaborn as sns
df = sns.load_dataset('flights')
df = df.set_index(['year', 'month'])
df2 = df.groupby(level=0)['passengers'].nlargest(3)

This is the shape of the desired output:

# Sum of top 3 months for each year (no index other than default pandas dataframe index)

year    sum
0   1949    600 (the sum is made up)
1   1950    600
10  1960    600

Thanks for any suggestions!

Edit: System Information:

python 3.6.0

Pandas 0.19.2

Windows 7


I think < a href="" rel="noreferrer noopener nofollow" > GroupBy.apply , Series.nlargest and sum:

df = sns.load_dataset('flights')
df2 = df.groupby('year')['passengers'].apply(lambda x: x.nlargest(3).sum()).reset_index()
print (df2)
    year  passengers
0   1949         432
1   1950         498
2   1951         582
3   1952         690
4   1953         779
5   1954         859
6   1955        1026
7   1956        1192
8   1957        1354
9   1958        1431
10  1959        1579
11  1960        1763

Related Problems and Solutions