Python Pandas checks whether a value appears multiple times on the same day
I have a Pandas data frame as shown below. What I want to do is check if a station has the variable yyy
and any other variables on the same day (as is the case with station1
). If this is true, I need to delete the entire row containing yyy.
Currently I’m doing this with iterrows()
and iterating through the date this variable appears, changing the variable to something like “delete me”, and from that I build a new data frame (because pandas doesn’t support replacing in place and filter the new data frame to remove unwanted rows. This works now because my data frame is small but unlikely to scale.
Question: This seems like a very “non-Pandas” approach, is there another way to remove unwanted variables?
dateuse station variable1
0 2012-08-12 00:00:00 station1 xxx
1 2012-08-12 00:00:00 station1 yyy
2 2012-08-23 00:00:00 station2 aaa
3 2012-08-23 00:00:00 station3 bbb
4 2012-08-25 00:00:00 station4 ccc
5 2012-08-25 00:00:00 station4 ccc
6 2012-08-25 00:00:00 station4 ccc
Solution
I might use a bool array for indexing. We’re going to remove rows with yyy
and multiple dateuse
/station
combinations (anyway, if I understand what you mean!)
We can use transform
to broadcast the size of each dateuse
/station
combination up to the length of the dataframe, and then select the row of length > 1 in the group. Then we can use where yyy
is located to &
.
>>> multiple = df.groupby(["dateuse", "station"])["variable1"].transform(len) > 1
>>> must_be_isolated = df["variable1"] == "yyy"
>>> df[~(multiple & must_be_isolated)]
dateuse station variable1
0 2012-08-12 00:00:00 station1 xxx
2 2012-08-23 00:00:00 station2 aaa
3 2012-08-23 00:00:00 station3 bbb
4 2012-08-25 00:00:00 station4 ccc
5 2012-08-25 00:00:00 station4 ccc
6 2012-08-25 00:00:00 station4 ccc