Pandas count if the number of occurrences is greater than x in a rolling 7-day interval
I have a data with time_stamp
and user_id
.
The data is sorted by user_id
and time_stamp
.
I want to get the number of visits within a 7-day rolling interval
for each user_id and calculate the number of visits within that 7-day
rolling interval. If the number of visits is > 3, a new column is set to 1 for that user_id, otherwise 0.
My data
time_stamp,user_id,visited
2014-04-22 03:53:30,1,1
2013-11-15 03:45:04,2,1
2013-11-29 03:45:04,2,1
2013-12-09 03:45:04,2,1
2013-12-25 03:45:04,2,1
2013-12-31 03:45:04,2,1
2014-01-08 03:45:04,2,1
2014-02-03 03:45:04,2,1 <- in a 7 day interval (till 2014-02-10)
2014-02-08 03:45:04,2,1 <- there is 3 visits
2014-02-09 03:45:04,2,1 <-
Expected output
user_id outcome
1 0
2 1
What I tried :
I don’t know the best way. My first thought was to get the continuous time difference for each user_id
.
df['timediff'] = df.groupby(['user_id','time_stamp'])['time_stamp'].diff() / np.timedelta64(1, 'D')
But this output is given to all Nans.
2014-04-22 03:53:30 1 1 NaN
1 2013-11-15 03:45:04 2 1 NaN
2 2013-11-29 03:45:04 2 1 NaN
3 2013-12-09 03:45:04 2 1 NaN
4 2013-12-25 03:45:04 2 1 NaN
How can I change the algorithm?
Solution
I would use df.rolling
If each user_id has a series, the offset is 7 days. Then you can check anywhere in the series, the rolling sum is greater than or equal to 3. To get the series for each user, use set_index )
and >unstack :
(df.set_index(['time_stamp', 'user_id'])
.unstack()
.rolling('7D')
.sum()
.ge(3)
.any())
# user_id
#visited 1 False
# 2 True