Python – sklearn : apply same scaling to train and predict in a pipeline

sklearn : apply same scaling to train and predict in a pipeline… here is a solution to the problem.

sklearn : apply same scaling to train and predict in a pipeline

I’m writing a function where the best model is selected by k-fold cross-validation. Inside the function, I have a pipeline

  1. Scale data
  2. Find the best parameters for the decision tree regressor

Then I want to use the model to predict some target values. To do this, I have to apply the same scaling factor that was applied during the grid search.

Does the pipeline use the same fit as the training data to transform the data I want to predict the target, even if I don’t specify it? I’ve been looking for documentation from here It seems that it did, but I’m not sure at all because this is my first time using the pipeline.

def build_model(data, target, param_grid):
    # compute feature range
    features = df.keys()
    feature_range = dict()
    maxs = df.max(axis=0)
    mins = df.min(axis=0)
    for feature in features:
        if feature is not 'metric':
            feature_range[feature] = {'max': maxs[feature], 'min': mins[feature]}

# initialise the k-fold cross validator
    no_split = 10
    kf = KFold(n_splits=no_split, shuffle=True, random_state=42)
    # create the pipeline
    pipe = make_pipeline(MinMaxScaler(), 
                         GridSearchCV(
                             estimator=DecisionTreeRegressor(), 
                             param_grid=param_grid, 
                             n_jobs=-1, 
                             cv=kf, 
                             refit=True))
    pipe.fit(data, target)

return pipe, feature_range

max_depth = np.arange(1,10)
min_samples_split = np.arange(2,10)
min_samples_leaf = np.arange(2,10) 
param_grid = {'max_depth': max_depth, 
              'min_samples_split': min_samples_split, 
              'min_samples_leaf': min_samples_leaf}
pipe, feature_range = build_model(data=data, target=target, param_grid=param_grid)

# could that be correct?
pipe.fit(test_data)

EDIT: I found in the [preprocessing] documentation that each preprocessing tool has an API

compute the [transformation] on a training set so as to be able reapply the same transformation on the testing set

If so, it may save the conversion internally, so the answer is probably yes.

Solution

The sklearn pipeline will call fit_transform or fit and then transform if all steps are present except for the fit_transform method. Therefore, in your pipeline, the scaling step causes the data to be transformed before GridSearchCV.

Documentation here

Related Problems and Solutions