pd.qcut returns a negative value
This is a simple data sample series:
sample
Out[2]:
0 0.047515
1 0.026392
2 0.024652
3 0.022854
4 0.020397
5 0.000087
6 0.000087
7 0.000078
8 0.000078
9 0.000078
The lower value is 0.000078 and the maximum value is 0.047515.
When I use the qcut function on it, the result gives negative data for my category.
pd.qcut(sample, 4)
Out[31]:
0 (0.0242, 0.0475]
1 (0.0242, 0.0475]
2 (0.0242, 0.0475]
3 (0.0102, 0.0242]
4 (0.0102, 0.0242]
5 (8.02e-05, 0.0102]
6 (8.02e-05, 0.0102]
7 (-0.000922, 8.02e-05]
8 (-0.000922, 8.02e-05]
9 (-0.000922, 8.02e-05]
Name: data, dtype: category
Categories (4, interval[float64]): [(-0.000922, 8.02e-05] < (8.02e-05, 0.0102] < (0.0102, 0.0242] < (0.0242, 0.0475]]
Is this expected behavior? I thought I would find my minimum and maximum as the lower and upper limits of my category.
(I’m using pandas 0.22.0 and python-2.7).
Solution
This happens because the boxing process subtracts .001 from the lowest value in your range. If the exact number in the edge == series of bin, it is not clear which bin the number should be put in. Therefore, it makes sense to adjust the minimum and maximum values slightly before creating qtiles.
See lines 210-213 in the pd.cut source code. https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/core/reshape/tile.py#L210-L213
0.000078 -.001
Out[21]: -0.0009220000000000001