How to randomly sample from a python list while keeping data distributed
Essentially, what I’m going to do is randomly select items from a list while maintaining internal distribution. See the following example.
a = 17%
b = 12%
c = 4%
etc.
“A” has 1700 items in the list.
“b” has 1200 items in the list.
“C” has 400 items in the list.
I want a sample that simulates the distribution of a, b, c, etc. instead of using all the information.
So the ultimate goal is,
170 items randomly selected from “A”
120 items randomly selected from “b”
40 items are randomly selected from “C”
I know how to randomly select
information from a list, but I haven’t been able to figure out how to randomly select while forcing the results to have the same distribution.
Solution
If your list is not very large and memory is not an issue, you can use this simple method.
To get n
elements from a
, b
, and c
, you can concatenate these three lists together and then use them >random.choice Select random elements from the list of results:
import random
n = 50
a = ['a'] * 170
b = ['b'] * 120
c = ['c'] * 40
big_list = a + b + c
random_elements = [random.choice(big_list) for i in range(n)]
# ['a', 'c', 'a', 'a', 'a', 'b', 'a', 'c', 'b', 'a', 'c', 'a',
# 'a', 'a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'a', 'c', 'a',
# 'c', 'a', 'b', 'a', 'a', 'c', 'a', 'b', 'a', 'c', 'b', 'a',
# 'a', 'b', 'a', 'b', 'a', 'a', 'c', 'a', 'c', 'a', 'b', 'c',
# 'b', 'b']
For each element, you will get a len(a)/len(a + b + c)
probability to get an element from a
.
Still, you may get the same element multiple times. If you don’t want this to happen, you can use random.shuffle