scipy stats binom cdf returns nan
If I understand correctly, the CDF
of the discrete distribution of scipy.stats
should return the sum of the probabilities for a given parameter value.
Therefore, scipy.stats.binom(7000000000, 0.5).cdf(6999999999)
should return a value of almost exactly 1, because out of 7 billion trials, there is a 50/50 probability of success at 7 billion minus 1 or less. Instead, I got np.nan.
In fact, for any value provided to .cdf
, I would return np.nan
except 7 billion itself (or more).
What’s going on? Are there some restrictions on the numbers that the scipy.stats
distribution can handle, but not in the documentation?
Solution
TL; Dr
Lack of floating-point precision when doing internal calculations. Although scipy is a Python library, it is written in C at its core and uses C numeric types.
For example:
import scipy.stats
for i in range (13):
trials = 10 ** i
print(f"i: {i}\tprobability: {scipy.stats.binom(trials, 0.5).cdf(trials - 1)}")
The output is:
i: 0 probability: 0.5
i: 1 probability: 0.9990234375
i: 2 probability: 0.9999999999999999
i: 3 probability: 0.9999999999999999
i: 4 probability: 0.9999999999999999
i: 5 probability: 0.9999999999999999
i: 6 probability: 0.9999999999999999
i: 7 probability: 0.9999999999999999
i: 8 probability: 0.9999999999999999
i: 9 probability: 0.9999999999999999
i: 10 probability: nan
i: 11 probability: nan
i: 12 probability: nan
The reason lies in the CDF formula for binomial distribution (I can’t embed an image, so here is the wikilink: https://en.wikipedia.org/wiki/Binomial_distribution
In the scipy source code, we see a reference to this implementation: http://www.netlib.org/cephes/doubldoc.html#bdtr
In its depths it involves partitioning by trials
(incbet.c, line 375: ai = 1.0/a;
It’s called a
, but nwm). If your trials
are too large, the result of this division will be so small that when we add this small number to another number that is not so small, it doesn’t actually change because we lack floating-point precision here (currently only 64 bits). Then, after more arithmetic, we try to get the logarithm from a number, but it is equal to zero because it does not change when it should have changed. log(0)
is undefined and is equal to np.nan.