Python – scipy stats binom cdf returns nan

scipy stats binom cdf returns nan… here is a solution to the problem.

scipy stats binom cdf returns nan

If I understand correctly, the CDF of the discrete distribution of scipy.stats should return the sum of the probabilities for a given parameter value.

Therefore, scipy.stats.binom(7000000000, 0.5).cdf(6999999999) should return a value of almost exactly 1, because out of 7 billion trials, there is a 50/50 probability of success at 7 billion minus 1 or less. Instead, I got np.nan. In fact, for any value provided to .cdf, I would return np.nan except 7 billion itself (or more).

What’s going on? Are there some restrictions on the numbers that the scipy.stats distribution can handle, but not in the documentation?

Solution

TL; Dr

Lack of floating-point precision when doing internal calculations. Although scipy is a Python library, it is written in C at its core and uses C numeric types.


For example:

import scipy.stats

for i in range (13):
    trials = 10 ** i
    print(f"i: {i}\tprobability: {scipy.stats.binom(trials, 0.5).cdf(trials - 1)}")

The output is:

i: 0    probability: 0.5
i: 1    probability: 0.9990234375
i: 2    probability: 0.9999999999999999
i: 3    probability: 0.9999999999999999
i: 4    probability: 0.9999999999999999
i: 5    probability: 0.9999999999999999
i: 6    probability: 0.9999999999999999
i: 7    probability: 0.9999999999999999
i: 8    probability: 0.9999999999999999
i: 9    probability: 0.9999999999999999
i: 10   probability: nan
i: 11   probability: nan
i: 12   probability: nan

The reason lies in the CDF formula for binomial distribution (I can’t embed an image, so here is the wikilink: https://en.wikipedia.org/wiki/Binomial_distribution

In the scipy source code, we see a reference to this implementation: http://www.netlib.org/cephes/doubldoc.html#bdtr

In its depths it involves partitioning by trials (incbet.c, line 375: ai = 1.0/a; It’s called a, but nwm). If your trials are too large, the result of this division will be so small that when we add this small number to another number that is not so small, it doesn’t actually change because we lack floating-point precision here (currently only 64 bits). Then, after more arithmetic, we try to get the logarithm from a number, but it is equal to zero because it does not change when it should have changed. log(0) is undefined and is equal to np.nan.

Related Problems and Solutions