Based on the sum of date ranges in two separate columns
I want to sum all values in one column based on a date range in two columns:
Start_Date Value_to_sum End_date
2017-12-13 2 2017-12-13
2017-12-13 3 2017-12-16
2017-12-14 4 2017-12-15
2017-12-15 2 2017-12-15
Simple groupby doesn’t do this because it only adds values for a specific date.
We could do an embedded for loop, but it will run forever:
unique_date = carry. Start_Date.unique()
carry = pd. DataFrame({'Date':unique_date})
carry['total'] = 0
for n in tqdm(range(len(carry))):
tr = data.loc[data['Start_Date'] >= carry['Date'][n]]
for i in tr.index:
if carry['Date'][n] <= tr['End_date'][i]:
carry['total'][n] += tr['Value_to_sum'][i]
Something like that will work, but like I said it will take a long time.
The expected output is a unique date and total number per day.
It should be
2017-12-13 = 5, 2017-12-14 = 7, 2017-12-15 = 9.
How do I calculate sum based on date range?
Solution
First, group by [“Start_Date”, “End_date”] to save some operations.
from collections import Counter
c = Counter()
df_g = df.groupby(["Start_Date", "End_date"]).sum().reset_index()
def my_counter(row):
s, v, e = row. Start_Date, row. Value_to_sum, row. End_date
if s == e:
c[pd. Timestamp(s, freq="D")] += row. Value_to_sum
else:
c.update({date: v for date in pd.date_range(s, e)})
df_g.apply(my_counter, axis=1)
print(c)
"""
Counter({Timestamp('2017-12-15 00:00:00', freq='D'): 9,
Timestamp('2017-12-14 00:00:00', freq='D'): 7,
Timestamp('2017-12-13 00:00:00', freq='D'): 5,
Timestamp('2017-12-16 00:00:00', freq='D'): 3})
"""
Tools used:
Counter.update([iterable-or-mapping]):
Elements are counted from an iterable or added-in from another mapping (or counter). Like dict.update() but adds counts instead of replacing them. Also, the iterable is expected to be a sequence of elements, not a sequence of (key, value) pairs. — Cited from Python 3 Documentation