Cross-validation in linear regression
I’m trying to do cross-validation in linear regression, for which I’m using the python sklearn library. I have a problem with the appropriate way to perform cross-validation on a given dataset.
Two APIs that confuse me a bit are cross_val_score()
and any regularization cross-validation algorithm, such as LassoCV().
As I understand it, cross_val_score
is used to get scores based on cross-validation. And, it can be combined with Lasso()
to achieve regularized cross-validation scores (example: here)。
In contrast, LassoCV()
is it’s The documentation recommends performing LASSO
for a given range of tuning parameters (alpha or lambda).
Now, my question is:
- Which method is better (
cross_val_score
with Lasso or justLassoCV).
- What is the correct way to perform linear cross-validation
Regression (or other algorithms such as logistic, NN, etc.).
Thank you.
Solution
To confuse you even more – consider using GridSearchCV, which will do cross-validation and adjust hyperparameters.
Demo:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso, Ridge, SGDRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size = 0.33)
pipe = Pipeline([
('scale', StandardScaler()),
('regr', Lasso())
])
param_grid = [
{
'regr': [Lasso(), Ridge()],
'regr__alpha': np.logspace(-4, 1, 6),
},
{
'regr': [SGDRegressor()],
'regr__alpha': np.logspace(-5, 0, 6),
'regr__max_iter': [500, 1000],
},
]
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid.fit(X_train, y_train)
predicted = grid.predict(X_test, y_test)
print('Score:\t{}'.format(grid.score(X_test, y_test)))