Python – Standardize histograms using matplotlib

Standardize histograms using matplotlib… here is a solution to the problem.

Standardize histograms using matplotlib

I want to

plot the histogram with Matplotlib, but I want the value of bin to represent a percentage of the total observations. The MWE will look like this:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import matplotlib.pyplot as plt
import matplotlib.ticker as tck
import seaborn as sns
import numpy

sns.set(style='dark')

imagen2 = plt.figure(1, figsize=(5, 2))
imagen2.suptitle('StackOverflow Matplotlib histogram demo')

luminance = numpy.random.randn(1000, 1000)
# "Luminance" should range from 0.0...1.0 so we normalize it
luminance = (luminance - luminance.min())/(luminance.max() - luminance.min())

top_left = plt.subplot(121)
top_left.imshow(luminance)
bottom_left = plt.subplot(122)
sns.distplot(luminance.flatten(), kde_kws={"cumulative": True})

# plt.savefig("stackoverflow.pdf", dpi=300)
plt.tight_layout(rect=(0, 0, 1, 0.95))
plt.show()

Here the CDF is OK (range: [0, 1]), but the histogram I get is not what I expected:

Histogram with values out of valid range

Why are the results of the histogram in the range [0, 4]? Is there any way to solve this problem?

Solution

What do you think you want

Here’s how to plot the histogram so that the bins sum to 1:

import matplotlib.pyplot as plt
import matplotlib.ticker as tck
import seaborn as sns
import numpy as np

sns.set(style='dark')

imagen2 = plt.figure(1, figsize=(5, 2))
imagen2.suptitle('StackOverflow Matplotlib histogram demo')

luminance = numpy.random.randn(1000, 1000)
# "Luminance" should range from 0.0...1.0 so we normalize it
luminance = (luminance - luminance.min())/(luminance.max() - luminance.min())

# get the histogram values
heights,edges = np.histogram(luminance.flat, bins=30)
binCenters = (edges[:-1] + edges[1:])/2

# norm the heights
heights = heights/heights.sum()

# get the cdf
cdf = heights.cumsum()

left = plt.subplot(121)
left.imshow(luminance)
right = plt.subplot(122)
right.plot(binCenters, cdf, binCenters, heights)

# plt.savefig("stackoverflow.pdf", dpi=300)
plt.tight_layout(rect=(0, 0, 1, 0.95))
plt.show()

# confirm that the hist vals sum to 1
print('heights sum: %.2f' % heights.sum())

Output:

enter image description here

heights sum: 1.00

Practical answer

This one is actually super simple. That’s it

sns.distplot(luminance.flatten(), kde_kws={"cumulative": True}, norm_hist=True)

Here’s what I got when I ran the modified script above:

enter image description here

Surprise twist!

It turns out that your histogram has always been normalized by official logo:

enter image description here

In plain English, it is common practice to normalize histograms of continuous values (i.e., their observations can be represented as floats) based on density. So in this case, the sum of the bin width multiplied by the bin height will be 1.0, as you can see by running this simplified version of the script:

import matplotlib.pyplot as plt
import matplotlib.ticker as tck
import numpy as np

imagen2 = plt.figure(1, figsize=(4,3))
imagen2.suptitle('StackOverflow Matplotlib histogram demo')

luminance = numpy.random.randn(1000, 1000)
luminance = (luminance - luminance.min())/(luminance.max() - luminance.min())

heights,edges,patches = plt.hist(luminance.ravel(), density=True, bins=30)
widths = edges[1:] - edges[:-1]

totalWeight = (heights*widths).sum()

# plt.savefig("stackoverflow.pdf", dpi=300)
plt.tight_layout(rect=(0, 0, 1, 0.95))
plt.show()
print(totalWeight)

And totalWeight does equal exactly 1.0, with a bit of rounding error.

Related Problems and Solutions