Python – Pandas count if the number of occurrences is greater than x in a rolling 7-day interval

Pandas count if the number of occurrences is greater than x in a rolling 7-day interval… here is a solution to the problem.

Pandas count if the number of occurrences is greater than x in a rolling 7-day interval

I have a data with time_stamp and user_id.

The data is sorted by user_id and time_stamp.

I want to get the number of visits within a 7-day rolling interval

for each user_id and calculate the number of visits within that 7-day rolling interval. If the number of visits is > 3, a new column is set to 1 for that user_id, otherwise 0.

My data

time_stamp,user_id,visited
2014-04-22 03:53:30,1,1
2013-11-15 03:45:04,2,1
2013-11-29 03:45:04,2,1
2013-12-09 03:45:04,2,1
2013-12-25 03:45:04,2,1
2013-12-31 03:45:04,2,1
2014-01-08 03:45:04,2,1
2014-02-03 03:45:04,2,1 <- in a 7 day interval (till 2014-02-10)
2014-02-08 03:45:04,2,1 <- there is 3 visits
2014-02-09 03:45:04,2,1 <-

Expected output

user_id outcome
1       0
2       1

What I tried :

I don’t know the best way. My first thought was to get the continuous time difference for each user_id.

df['timediff'] = df.groupby(['user_id','time_stamp'])['time_stamp'].diff() / np.timedelta64(1, 'D')

But this output is given to all Nans.

    2014-04-22 03:53:30 1   1   NaN
1   2013-11-15 03:45:04 2   1   NaN
2   2013-11-29 03:45:04 2   1   NaN
3   2013-12-09 03:45:04 2   1   NaN
4   2013-12-25 03:45:04 2   1   NaN

How can I change the algorithm?

Solution

I would use df.rolling If each user_id has a series, the offset is 7 days. Then you can check anywhere in the series, the rolling sum is greater than or equal to 3. To get the series for each user, use set_index ) and >unstack :

(df.set_index(['time_stamp', 'user_id'])
.unstack()
.rolling('7D')
.sum()
.ge(3)
.any())

#         user_id
#visited  1          False
#         2           True

Related Problems and Solutions