Python - Conditionally samples rows from a Pandas DataFrame

Conditionally samples rows from a Pandas DataFrame… here is a solution to the problem.

Conditionally samples rows from a Pandas DataFrame

I have a pandas DataFrame in which some of them are overcrowded. I want to do a second sampling to limit the number of each observation to a certain maximum number.

Now I’m doing this in a loop and trying to build a DataFrame from a dictionary. But indexes are getting in my way, and I wish someone could point to some simpler solutions. Real data, with ~20K rows, ~4K columns, and ~400 people. Thank you.

Sample data.

df = pd. DataFrame({'name': ["Alice", "Alice", "Charles", "Charles", "Charles", "Kumar", "Kumar", "Kumar", "Kumar"],
              'height': [124, 125, 169, 178, 177, 172, 173, 175, 174]})

df
    height name
0   124 Alice
1   125 Alice
2   169 Charles
3   178 Charles
4   177 Charles
5   172 Kumar
6   173 Kumar
7   175 Kumar
8   174 Kumar

Now my code, for this example, tries to limit each person to 2 lines.

sub_df = []
for name in pd.unique(df.name):
    sub_df.append(df[df.name == name].sample(n=2, random_state=42).to_dict())

pd. DataFrame(sub_df)

What I got.

    height               name
0   {1: 125, 0: 124}    {1: 'Alice', 0: 'Alice'}
1   {2: 169, 3: 178}    {2: 'Charles', 3: 'Charles'}
2   {6: 174, 8: 175}    {6: 'Kumar', 8: 'Kumar'}

What I want.

    height name
0   125 Alice
1   124 Alice
2   169 Charles
3   178 Charles
4   174 Kumar
5   175 Kumar

Solution

Execute groupby on 'name' and then use sample :

# groupby and sample
df = df.groupby('name').apply(lambda grp: grp.sample(n=2))

# formatting
df = df.reset_index(drop=True)

Result output:

   height     name
0     125    Alice
1     124    Alice
2     177  Charles
3     169  Charles
4     175    Kumar
5     173    Kumar

Python – Conditionally samples rows from a Pandas DataFrame

Conditionally samples rows from a Pandas DataFrame

Solution

Related Problems and Solutions