Python – Pandas: Substitution with boolean values produces inconsistent results

Pandas: Substitution with boolean values produces inconsistent results… here is a solution to the problem.

Pandas: Substitution with boolean values produces inconsistent results

I

have a data frame that consists of check marks like x and v, and I’m replacing the boolean value with the following line:

df.replace({'v': True, 'x': False}, inplace=True)

Before running df.replace(), all columns are of type object according to df.dtypes. After replace(), all other columns are still object, and only one column is of type bool, where the value is numpy.bool_. Pycharm displays the truth value of this particular column with a red background, as shown below.

pandas boolean dataframe

Why is this happening? Isn't Object suitable for storing boolean values? Why did Pandas change the dtype of this column from object to bool? What exactly controls it, and how do I force dtype to remain as object?

Is there a reason to change all columns to pandas.np.bool, e.g. for performance reasons?

Solution

Pandas stores sequences internally as NumPy arrays. When a series has mixed types, Pandas/NumPy must make a decision: it chooses one that contains all types in the series. As a simple example, if you have a series of integers of type int and change a single value to float, your series will become float type

In this example, your series 0 and 2 have NaN values. Now NaN or np.nan is considered a float (try type(np.nan), which will return float), while True/False is considered a boolean value. The only way NumPy can store these values is to use a dtype object, which is just a bunch of pointers (much like a list).

On the other hand, your first column only has boolean values, which can be stored with the bool type. The good thing here is that because you are not using a collection of pointers, NumPy can allocate a contiguous block of memory for this array. This results in a performance advantage over the object family or list.

You can test all of the above yourself. Here are some examples:

s1 = pd. Series([True, False])
print(s1.dtype)  # bool

s2 = pd. Series([True, False, np.nan])
print(s2.dtype)  # object

s3 = pd. Series([True, False, 0, 1])
print(s3.dtype)  # object

The last example is interesting because in Python both true == 1 and false == 0 return true because bool can be thought of as int subclass of . Therefore, internally, Pandas/NumPy decided not to enforce this equality and chose one of them. As a corollary of this, it is recommended that you check the type of the series when dealing with mixed types.

Also note that when you update a value, Pandas checks the data type:

s1 = pd. Series([True, 5.4])
print(s1.dtype)  # object

s1.iloc[-1] = False
print(s1.dtype)  # bool

Related Problems and Solutions