Python – Subtract the values of columns from two different data frames in PySpark to find RMSE

Subtract the values of columns from two different data frames in PySpark to find RMSE… here is a solution to the problem.

Subtract the values of columns from two different data frames in PySpark to find RMSE

I can’t figure it out. I’m trying to calculate RMSE between test and prediction data.

Test

col1    col2
 a        2 
 b        3

Forecast

col1   col2
 a       4 
 b       5

I’m trying to do this test (col2)-prediction (col2). That is

2-4 =-2
3-5 =-2

I tried

test.select("col2").subtract(prediction.select("col2"))

But I didn’t get the results I wanted. I tried to get this result to find RMSE. Is there a built-in function in Spark for finding RMSE?

Thank you.

Solution

It is a connection and an arithmetic subtraction:

test.join(prediction, on="col1").withColumn("sub", test.col2-prediction.col2)

Related Problems and Solutions