Pandas: Substitution with boolean values produces inconsistent results
I
have a data frame that consists of check marks like x
and v
, and I’m replacing the boolean value with the following line:
df.replace({'v': True, 'x': False}, inplace=True)
Before running df.replace(),
all columns are of type object
according to df.dtypes.
After replace(),
all other columns are still object
, and only one column is of type bool, where the value is numpy.bool_
. Pycharm displays the truth value of this particular column with a red background, as shown below.
Why is this happening? Isn't Object
suitable for storing boolean values? Why did Pandas
change the dtype
of this column from object
to bool
? What exactly controls it, and how do I force dtype to remain as object
?
Is there a reason to change all columns to pandas.np.bool, e.g
. for performance reasons?
Solution
Pandas stores sequences internally as NumPy arrays. When a series has mixed types, Pandas/NumPy must make a decision: it chooses one that contains all types in the series. As a simple example, if you have a series of integers of type int
and change a single value to float, your series will become
float
type
In this example, your series 0 and 2 have NaN
values. Now NaN
or np.nan is considered a
float (try type(np.nan
), which will return float
), while True
/False
is considered a boolean value.
The only way NumPy can store these values is to use a dtype object
, which is just a bunch of pointers (much like a list).
On the other hand, your first column only has boolean values, which can be stored with the bool
type. The good thing here is that because you are not using a collection of pointers, NumPy can allocate a contiguous block of memory for this array. This results in a performance advantage over the object
family or list
.
You can test all of the above yourself. Here are some examples:
s1 = pd. Series([True, False])
print(s1.dtype) # bool
s2 = pd. Series([True, False, np.nan])
print(s2.dtype) # object
s3 = pd. Series([True, False, 0, 1])
print(s3.dtype) # object
The last example is interesting because in Python both true == 1
and false == 0
return true
because bool
can be thought of as int
subclass of . Therefore, internally, Pandas/NumPy decided not to enforce this equality and chose one of them. As a corollary of this, it is recommended that you check the type of the series when dealing with mixed types.
Also note that when you update a value, Pandas checks the data type:
s1 = pd. Series([True, 5.4])
print(s1.dtype) # object
s1.iloc[-1] = False
print(s1.dtype) # bool