Python – Monthly aggregation in pyspark

Monthly aggregation in pyspark… here is a solution to the problem.

Monthly aggregation in pyspark

I’m looking for a way to aggregate data by month. I first want to keep only one month on my visit dates. My data frame looks like this :

Row(visitdate = 1/1/2013, 
patientid = P1_Pt1959, 
amount = 200, 
note = jnut, 
) 

My subsequent goal is to group by date of visit and calculate the sum. I’ve tried this :

from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

file_path = "G:/Visit Data.csv"
patients = spark.read.csv(file_path,header = True)
patients.createOrReplaceTempView("visitdate")

sqlDF = spark.sql("SELECT visitdate,SUM(amount) as totalamount from visitdate GROUP BY visitdate")
sqlDF.show()

Here is the result:

visitdate|totalamount|
+----------+-----------+
|  9/1/2013|    10800.0|
|25/04/2013|    12440.0|
|27/03/2014|    16930.0|
|26/03/2015|    18560.0|
|14/05/2013|    13770.0|
|30/06/2013|    13880.0

My goal is to get something like this:

  visitdate|totalamount|
+----------+-----------+
|1/1/2013|    10800.0|
|1/2/2013|    12440.0|
|1/3/2013|    16930.0|
|1/4/2014|    18560.0|
|1/5/2015|    13770.0|
|1/6/2015|    13880.0|

Solution

You can format visitdate is grouped first:

from pyspark.sql import functions as F

(df.withColumn('visitdate_month', F.date_format(F.col('visitdate'), '1/M/yyyy'))
.groupBy('visitdate_month')
.agg(F.sum(F.col('visitdate_month')))
)

Related Problems and Solutions