R and linear regression in Python – different results for the same problem
I’m training my data skills with python that I learned in R. Although, I have questions about simple linear regression
Climate change data:
[link here]
Python scripts
import os
import pandas as pd
import statsmodels.api as sm
train = df[df. Year>=2006]
X = train[['MEI', 'CO2', 'CH4', 'N2O', 'CFC.11', 'CFC.12', 'TSI', 'Aerosols']]
y = train[['Temp']]
model = sm. OLS(y, X).fit()
predictions = model.predict(X)
model.summary()
Python results
Dep. Variable: Temp R-squared: 0.972
Model: OLS Adj. R-squared: 0.964
Method: Least Squares F-statistic: 123.1
Date: Mon, 01 Oct 2018 Prob (F-statistic):9.54e-20
Time: 14:52:53 Log-Likelihood: 46.898
No. Observations: 36 AIC: -77.80
Df Residuals: 28 BIC: -65.13
Df Model: 8
Covariance Type: nonrobust
MEI
0.0361CO2
0.0046CH4
-0.0023N2O
-0.0141CFC-11
-0.0312CFC-12
0.0358TSI
-0.0033Aerosols
69.9680Omnibus:8.397
Durbin-Watson: 1.484Prob(Omnibus):0.015
Jarque-Bera (JB):10.511Skew: -0.546
Prob(JB): 0.00522Kurtosis: 5.412
Cond. No. 6.35e+06
R scripts
train <- climate_change[climate_change$Year>=2006,]
prev <- lm(Temp ~ ., data = train[,3:NCOL(train)])
summary(prev)
R result
Residuals:
Min 1Q Median 3Q Max
-0.221684 -0.032846 0.002042 0.037158 0.167887Coefficients:
MEI 0.036056
CO2 0.004817
CH4 -0.002366
N2O -0.013007
CFC-11 -0.033194
CFC-12 0.037775
TSI 0.009100
Aerosols 70.463329
Residual standard error: 0.07594 on 27 degrees of freedom Multiple
R-squared: 0.5346, Adjusted R-squared: 0.3967 F-statistic: 3.877 on
8 and 27 DF, p-value: 0.003721
Question
The R-squared of the two is very different, and the coefficients of the independent variables are also a little different. Can someone explain why?
Solution
Just to point this out: the least squares fit of statsmodel
does not include constants by default. If we remove constants from the fit of R, we get very similar results to the Python implementation, or conversely, if we add a constant to statsmodel-fit
, we get the same as R
:
Remove constants in R's
lm
call:
summary(lm(Temp ~ . - 1, data = train[,3:NCOL(train)]))
Call:
lm(formula = Temp ~ . - 1, data = train[, 3:NCOL(train)])
Residuals:
Min 1Q Median 3Q Max
-0.221940 -0.032347 0.002071 0.037048 0.167294
Coefficients:
Estimate Std. Error t value Pr(>|t|)
MEI 0.036076 0.027983 1.289 0.2079
CO2 0.004640 0.008945 0.519 0.6080
CH4 -0.002328 0.002132 -1.092 0.2843
N2O -0.014115 0.079452 -0.178 0.8603
`CFC-11` -0.031232 0.096693 -0.323 0.7491
`CFC-12` 0.035760 0.103574 0.345 0.7325
TSI -0.003283 0.036861 -0.089 0.9297
Aerosols 69.968040 33.093275 2.114 0.0435 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.07457 on 28 degrees of freedom
Multiple R-squared: 0.9724, Adjusted R-squared: 0.9645
F-statistic: 123.1 on 8 and 28 DF, p-value: < 2.2e-16
Let’s add a constant to the statsmodel
call:
X_with_constant = sm.add_constant(X)
model = sm. OLS(y, X_with_constant).fit()
model.summary()
Give us the same result:
OLS Regression Results
Dep. Variable: Temp R-squared: 0.535
Model: OLS Adj. R-squared: 0.397
Method: Least Squares F-statistic: 3.877
Date: Tue, 02 Oct 2018 Prob (F-statistic): 0.00372
Time: 10:14:03 Log-Likelihood: 46.899
No. Observations: 36 AIC: -75.80
Df Residuals: 27 BIC: -61.55
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -17.8663 563.008 -0.032 0.975 -1173.064 1137.332
MEI 0.0361 0.029 1.265 0.217 -0.022 0.095
CO2 0.0048 0.011 0.451 0.656 -0.017 0.027
CH4 -0.0024 0.002 -0.950 0.351 -0.007 0.003
N2O -0.0130 0.088 -0.148 0.884 -0.194 0.168
CFC-11 -0.0332 0.116 -0.285 0.777 -0.272 0.205
CFC-12 0.0378 0.123 0.307 0.761 -0.215 0.290
TSI 0.0091 0.392 0.023 0.982 -0.795 0.813
Aerosols 70.4633 37.139 1.897 0.069 -5.739 146.666
Omnibus: 8.316 Durbin-Watson: 1.488
Prob(Omnibus): 0.016 Jarque-Bera (JB): 10.432
Skew: -0.535 Prob(JB): 0.00543
Kurtosis: 5.410 Cond. No. 1.06e+08