Sorts the pandas data frame based on two similar columns, but if the other column has a value, one of the columns will be NaN
I have a merged df which has 2 experiment IDs – experiment_a and experiment_b
They are in the universal nomenclature EXPT_YEAR_NUM, but some have added value and do not have a year and no other. In this df, there is a value in experiment_a, experiment_b = NaN and vice versa.
Namely:
experiment_a experiment_b
EXPT_2011_06 NaN
NaN EXPT_2011_07
How can I sort so that the ascending values of experiment_a and _b are together, rather than ascending in experiment_a and _b has all NaN values, and then ascending with experiment_b when experiment_a has NaN values?
This is what happens when I use sort_values:
df = df.sort_values(['experiment_a', 'experiment_b'])
It obviously just sorts _a first and then _b.
Solution
I believe you need to < a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.fillna.html" rel="noreferrer noopener nofollow">fillna
for series
, then get the index of the value sorted by >argsort and finally select iloc
– the output is a sequenced column:
print (df)
experiment_a experiment_b
0 EXPT_2011_06 NaN
1 EXPT_2010_06 NaN
2 NaN EXPT_2011_07
df = df.iloc[df['experiment_a'].fillna(df['experiment_b']).argsort()]
print (df)
experiment_a experiment_b
1 EXPT_2010_06 NaN
0 EXPT_2011_06 NaN
2 NaN EXPT_2011_07
Details:
print (df['experiment_a'].fillna(df['experiment_b']))
0 EXPT_2011_06
1 EXPT_2010_06
2 EXPT_2011_07
Name: experiment_a, dtype: object
print (df['experiment_a'].fillna(df['experiment_b']).argsort())
0 1
1 0
2 2
Name: experiment_a, dtype: int64
I’ve tested more solutions and performance is a bit better with np.where
, but it mostly depends on the data:
print (df)
experiment_a experiment_b
0 EXPT_2011_03 NaN
1 NaN EXPT_2009_08
2 NaN EXPT_2010_06
3 EXPT_2010_07 NaN
4 NaN EXPT_2011_07
#[500000 rows x 2 columns]
df = pd.concat([df] * 100000, ignore_index=True)
In [41]: %timeit (df.iloc[(np.where(df['experiment_a'].isnull(), df['experiment_b'], df['experiment_a'])).argsort()])
1 loop, best of 3: 318 ms per loop
In [42]: %timeit (df.iloc[df['experiment_a'].fillna(df['experiment_b']).argsort()])
1 loop, best of 3: 335 ms per loop
In [43]: %timeit (df.iloc[df['experiment_a'].combine_first(df['experiment_b']).argsort()])
1 loop, best of 3: 333 ms per loop
In [44]: %timeit (df.iloc[df.experiment_a.where(df.experiment_a.notnull(), df.experiment_b).argsort()])
1 loop, best of 3: 342 ms per loop