Python – How to randomly sample from a python list while keeping data distributed

How to randomly sample from a python list while keeping data distributed… here is a solution to the problem.

How to randomly sample from a python list while keeping data distributed

Essentially, what I’m going to do is randomly select items from a list while maintaining internal distribution. See the following example.

a = 17%
b = 12%
c = 4%
etc.

“A” has 1700 items in the list.
“b” has 1200 items in the list.
“C” has 400 items in the list.

I want a sample that simulates the distribution of a, b, c, etc. instead of using all the information.

So the ultimate goal is,

170 items randomly selected from “A”
120 items randomly selected from “b”
40 items are randomly selected from “C”

I know how to randomly select

information from a list, but I haven’t been able to figure out how to randomly select while forcing the results to have the same distribution.

Solution

If your list is not very large and memory is not an issue, you can use this simple method.

To get n elements from a, b, and c, you can concatenate these three lists together and then use them >random.choice Select random elements from the list of results:

import random

n = 50
a = ['a'] * 170
b = ['b'] * 120
c = ['c'] * 40
big_list = a + b + c
random_elements = [random.choice(big_list) for i in range(n)]
# ['a', 'c', 'a', 'a', 'a', 'b', 'a', 'c', 'b', 'a', 'c', 'a',
# 'a', 'a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'a', 'c', 'a',
# 'c', 'a', 'b', 'a', 'a', 'c', 'a', 'b', 'a', 'c', 'b', 'a',
# 'a', 'b', 'a', 'b', 'a', 'a', 'c', 'a', 'c', 'a', 'b', 'c',
# 'b', 'b']

For each element, you will get a len(a)/len(a + b + c) probability to get an element from a.

Still, you may get the same element multiple times. If you don’t want this to happen, you can use random.shuffle

Related Problems and Solutions