Python - Each group of Pandas is randomly sampled

Each group of Pandas is randomly sampled… here is a solution to the problem.

Each group of Pandas is randomly sampled

I have a data frame very similar to it, but with thousands of values :

import numpy as np
import pandas as pd 

# Setup fake data.
np.random.seed([3, 1415])      
df = pd. DataFrame({
    'Class': list('AAAAAAAAAABBBBBBBBBB'),
    'type': (['short']*5 + ['long']*5) *2,
    'image name': (['image01']*2  + ['image02']*2)*5,
    'Value2': np.random.random(20)})

I was able to find a way to randomly sample 2 values per image, per category, and per type with the following code:

df2 = df.groupby(['type', 'Class', 'image name'])[['Value2']].apply(lambda s: s.sample(min(len(s),2)))

I got the following result:

I’m looking for a way to subset the table to be able to randomly select random images (“image names”) based on type and category (and reserve 2 values for randomly selected images.

I want an Excel example of the output:

Solution

IIUC, the problem is that you don’t want to group the image name column, but if it’s not included in the group by, you’ll lose it

You can create the Grouby object first

gb = df.groupby(['type', 'Class'])

Now you can use list inference to interact with the Grouby block

blocks = [data.sample(n=1) for _,data in gb]

You can now connect blocks to reconstruct randomly sampled data frames

pd.concat(blocks)

Output

   Class    Value2 image name   type
7      A  0.817744    image02   long
17     B  0.199844    image01   long
4      A  0.462691    image01  short
11     B  0.831104    image02  short

You can modify your code and add the column image name to groupby, like this

df.groupby(['type', 'Class'])[['Value2','image name']].apply(lambda s: s.sample(min(len(s),2)))

Value2 image name
type  Class
long  A     8   0.777962    image01
            9   0.757983    image01
      B     19  0.100702    image02
            15  0.117642    image02
short A     3   0.465239    image02
            2   0.460148    image02
      B     10  0.934829    image02
            11  0.831104    image02

Edit: Keep each set of images the same

I’m not sure you can avoid using an iterative process for this issue. You can just iterate through the groupby block, filter the groups, get random images and keep the name of each group the same, and then randomly sample from the remaining images like this

import random

gb = df.groupby(['Class','type'])
ls = []

for index,frame in gb:
    ls.append(frame[frame['image name'] == random.choice(frame['image name'].unique())].sample(n=2))

pd.concat(ls)

Output

   Class    Value2 image name   type
6      A  0.850445    image02   long
7      A  0.817744    image02   long
4      A  0.462691    image01  short
0      A  0.444939    image01  short
19     B  0.100702    image02   long
15     B  0.117642    image02   long
10     B  0.934829    image02  short
14     B  0.721535    image02  short

Python – Each group of Pandas is randomly sampled

Each group of Pandas is randomly sampled

Solution

Related Problems and Solutions