Conditionally samples rows from a Pandas DataFrame… here is a solution to the problem.
Conditionally samples rows from a Pandas DataFrame
I have a pandas DataFrame in which some of them are overcrowded. I want to do a second sampling to limit the number of each observation to a certain maximum number.
Now I’m doing this in a loop and trying to build a DataFrame from a dictionary. But indexes are getting in my way, and I wish someone could point to some simpler solutions. Real data, with ~20K rows, ~4K columns, and ~400 people. Thank you.
Sample data.
df = pd. DataFrame({'name': ["Alice", "Alice", "Charles", "Charles", "Charles", "Kumar", "Kumar", "Kumar", "Kumar"],
'height': [124, 125, 169, 178, 177, 172, 173, 175, 174]})
df
height name
0 124 Alice
1 125 Alice
2 169 Charles
3 178 Charles
4 177 Charles
5 172 Kumar
6 173 Kumar
7 175 Kumar
8 174 Kumar
Now my code, for this example, tries to limit each person to 2 lines.
sub_df = []
for name in pd.unique(df.name):
sub_df.append(df[df.name == name].sample(n=2, random_state=42).to_dict())
pd. DataFrame(sub_df)
What I got.
height name
0 {1: 125, 0: 124} {1: 'Alice', 0: 'Alice'}
1 {2: 169, 3: 178} {2: 'Charles', 3: 'Charles'}
2 {6: 174, 8: 175} {6: 'Kumar', 8: 'Kumar'}
What I want.
height name
0 125 Alice
1 124 Alice
2 169 Charles
3 178 Charles
4 174 Kumar
5 175 Kumar
Solution
Execute groupby
on 'name'
and then use sample
:
# groupby and sample
df = df.groupby('name').apply(lambda grp: grp.sample(n=2))
# formatting
df = df.reset_index(drop=True)
Result output:
height name
0 125 Alice
1 124 Alice
2 177 Charles
3 169 Charles
4 175 Kumar
5 173 Kumar