How to count consecutive occurrences in a time period? – Pandas python
I have a data frame like this:
datetime levels shear
2016-05-01 00:10:00 100 5
2016-05-01 00:10:00 200 6
2016-05-01 00:10:00 300 10
2016-05-01 00:10:00 400 11
2016-05-01 00:20:00 100 7
2016-05-01 00:20:00 300 10
2016-05-01 00:30:00 100 5
2016-05-01 00:30:00 200 9
2016-05-01 00:30:00 300 12
2016-05-01 00:30:00 400 15
2016-05-01 00:40:00 100 19
2016-05-01 00:40:00 200 2
2016-05-01 00:40:00 300 18
This is wind shear at different heights at 10-minute intervals.
What I want to do is group them by levels and then calculate the cuts that occur continuously for each level. So, first I create another column, which is the time difference of each continuous data group by level:
data3_5['Delta'] = data3_5.sort_values(['levels','datetimes']).groupby('levels')['datetimes'].diff()
Then define a function that rolls over the same time difference that occurs consecutively:
def rolling_count(val):
if val == rolling_count.previous:
rolling_count.count +=1
else:
rolling_count.previous = val
rolling_count.count = 1
return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable
Apply the function and add a new column:
data3_5['count'] = data3_5.sort_values(['levels','datetimes']).groupby('levels')['index'].diff().apply(rolling_count)
Then finally print out the result:
group = data3_5.groupby('levels', as_index=False)
group_keys = sorted(list(group.groups.keys()))
for each in range(0,len(group_keys)):
group_result = group.get_group(group_keys[each])
print(group_result)
Some of the results are as follows:
datetime levels shear Delta count
2016-05-01 00:10:00 100 5 NaT 1
2016-05-01 00:20:00 100 7 00:10:00 1
2016-05-01 00:30:00 100 5 00:10:00 2
2016-05-01 00:40:00 100 19 00:10:00 3
datetime levels shear Delta count
2016-05-01 00:10:00 200 6 NaT 1
2016-05-01 00:30:00 200 9 00:20:00 1
2016-05-01 00:40:00 200 2 00:10:00 1
This is not the result we want, because there will be two problems:
1. The first count of each level will be treated as NaT, so the next consecutive occurrence of the value will be considered the first occurrence.
Similar to question 1, for level 200, the third result should count as 2.
How do I fix this? Thank you.
Solution
Let’s try:
First let’s make sure df[‘datetime’] is indeed a datetime object:
df['datetime'] = pd.to_datetime(df['datetime'])
df['Delta'] = df.groupby('levels')['datetime'].transform(lambda x: x-x.shift(1))
df['count'] = df.groupby('levels')['Delta'].transform(lambda x : x.le(pd. Timedelta('10 minutes')).cumsum())
df.sort_values(by='levels')
Output:
datetime levels shear Delta count
0 2016-05-01 00:10:00 100 5 NaT 0
4 2016-05-01 00:20:00 100 7 00:10:00 1
6 2016-05-01 00:30:00 100 5 00:10:00 2
10 2016-05-01 00:40:00 100 19 00:10:00 3
1 2016-05-01 00:10:00 200 6 NaT 0
7 2016-05-01 00:30:00 200 9 00:20:00 0
11 2016-05-01 00:40:00 200 2 00:10:00 1
2 2016-05-01 00:10:00 300 10 NaT 0
5 2016-05-01 00:20:00 300 10 00:10:00 1
8 2016-05-01 00:30:00 300 12 00:10:00 2
12 2016-05-01 00:40:00 300 18 00:10:00 3
3 2016-05-01 00:10:00 400 11 NaT 0
9 2016-05-01 00:30:00 400 15 00:20:00 0