pd.notnull Strange null checking behavior
This is essentially a rehashing of my answer here content
I’m getting some weird behavior when trying to solve this question, using pd.notnull.
Consider
x = ('A4', nan)
I want to check which of these items are empty. Using np.isnan
directly throws a TypeError (but I’ve found a workaround).
Using pd.notnull
is invalid.
>>> pd.notnull(x)
True
It treats tuples as single values (rather than iterable values). Also, converting it to a list and then testing it will also give the wrong answer.
>>> pd.notnull(list(x))
array([ True, True])
Since the second value is nan
, the result I am looking for should be [True, False].
When you pre-converted to series, it finally worked:
>>> pd. Series(x).notnull()
0 True
1 False
dtype: bool
Therefore, the solution is to serialize it and then test the values.
Along a similar line, another (admittedly roundabout) solution is to pre-convert to object
dtype numpy arrays, and pd.notnull
or np.isnan
will work directly:
>>> pd.notnull(np.array(x, dtype=object))
Out[151]: array([True, False])
I imagine pd.notnull
directly implicitly converting x
to an array of strings, rendering NaN as the string “nan”, so it’s no longer an “empty” value.
Does pd.notnull
do the same thing here? Or should I pay attention to something else going on behind the scenes?
Notes
In [156]: pd.__version__
Out[156]: '0.22.0'
Solution
This is the issue associated with this behavior: https://github.com/pandas-dev/pandas/issues/20675 .
In short, if the argument passed to notnull
is of type list
, it is internally converted to np.array and uses the np.asarray
method.
This error occurs because, if dtype
is not specified, numpy converts np.nan
to string
(pd. isnull
as null value):
a = ['A4', np.nan]
np.asarray(a)
# array(['A4', 'nan'], dtype='<U3')
This issue was fixed in version 0.23.0 by calling np.asarray
with dtype=object
.