Python – Find the mean squared error of linear regression in Python (learn with scikit)

Find the mean squared error of linear regression in Python (learn with scikit)… here is a solution to the problem.

Find the mean squared error of linear regression in Python (learn with scikit)

I’m trying to do a simple linear regression in python where the x variable is the word
The count and y value of the item description is the velocity of funding in days.

I’m a bit confused because the root mean square error (RMSE) for the test is 13.77
The training data is 13.88. First, shouldn’t the RMSE be between 0 and 1?
Secondly Shouldn’t the RMSE of the test data be higher than the training data?
So I thought, I’m doing something wrong, but I’m not sure where the error is.

Also, I need to know the weight coefficients for regression, but unfortunately
Not sure how to print it because it’s a bit hidden in the sklearn method. Can someone help?

This is what I have at the moment :

import numpy as np
import matplotlib.pyplot as plt
import sqlite3
from sklearn.model_selection import train_test_split
from sklearn import linear_model

con = sqlite3.connect('database.db')
cur = con.cursor()

# y-variable in regression is funding speed ("DAYS_NEEDED")    
cur.execute("SELECT DAYS_NEEDED FROM success")
y = cur.fetchall()                  # list of tuples
y = np.array([i[0] for i in y])     # list of int   # y.shape = (1324476,)

# x-variable in regression is the project description length ("WORD_COUNT")
cur.execute("SELECT WORD_COUNT FROM success")
x = cur.fetchall()
x = np.array([i[0] for i in x])     # list of int   # x.shape = (1324476,)

# Get the train and test data split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Fit a model
lm = linear_model. LinearRegression()
x_train = x_train.reshape(-1, 1)    # new shape: (1059580, 1)
y_train = y_train.reshape(-1, 1)    # new shape: (1059580, 1)
model =, y_train)
x_test = x_test.reshape(-1, 1)      # new shape: (264896, 1)
predictions_test = lm.predict(x_test)
predictions_train = lm.predict(x_train)

print("y_test[5]: ", y_test[5])     # 14
print("predictions[5]: ", predictions_test[5]) # [ 12.6254537]

# Calculate the root mean square error (RMSE) for test and training data
N = len(y_test)
rmse_test = np.sqrt(np.sum((np.array(y_test).flatten() - np.array(predictions_test).flatten())**2)/N)
print("RMSE TEST: ", rmse_test)     # 13.770731326

N = len(y_train)
rmse_train = np.sqrt(np.sum((np.array(y_train).flatten() - np.array(predictions_train).flatten())**2)/N)
print("RMSE train: ", rmse_train)   # 13.8817814595

Any help is greatly appreciated! Thanks!


  1. RMSE has the same units as the dependent variable. This means that if the variable you’re trying to predict varies between 0 and 100, an RMSE of 99 is bad! If for data in the range of 0 to 100, an RMSE of 5 is quite amazing. However, if the RMSE for data from 1 to 10 is 5, then you have a problem! I hope this illustrates the point.

  2. Since the RMSE of your training and test is similar, please applaud yourself! You’ve actually done a great job! If RMSE of test > train, you’re a bit overfitting.

According to what Umang said in the review, you use model.coef_ and model.intercept_ to print the best weights calculated by your model.

Related Problems and Solutions