Python – Custom scoring of GridSearchCV using folding-related parameters

Custom scoring of GridSearchCV using folding-related parameters… here is a solution to the problem.

Custom scoring of GridSearchCV using folding-related parameters

Question

I’m working on a learning ranking problem where the standard is a point evaluation of predictions, but a group evaluation of model performance.

More specifically, the estimator outputs a continuous variable (much like a regressor).

> y = est.predict(X); y
array([71.42857143,  0.        , 71.42857143, ...,  0.        ,
       28.57142857,  0.        ])

However, the scoring function requires query aggregation, that is, group prediction, and similar groups parameters are passed to GridSearchCV to respect fold partitioning.

> ltr_score(y_true, y_pred, groups=g)
0.023

Obstacles

So far so good. Things got bad when providing custom scoring to GridSearchCV, I can’t dynamically change the groups parameter from the scoring feature based on CV folding:

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

ltr_scorer = make_scorer(ltr_score, groups=g)  # Here's the problem, g is fixed
param_grid = {...}

gcv = GridSearchCV(estimator=est, groups=g, param_grid=param_grid, scoring=ltr_scorer)

What is the easiest way to solve this problem?

A (failed) method

In similar In question, a comment asks/suggests:

Why cant you just store {the grouping column} locally and utilize it if necessary by indexing with the train test indices provided by the splitter?

The OP replied “seems to work”. I also think it’s doable, but not achievable. Obviously, GridSearchCV will split the index with all cross-validation before performing splitting, fitting, predicting, and scoring. This means that I can’t (ostensibly) try to guess at the original index that created the current split subselection at the time of scoring.

For completeness, my code:

class QuerySplitScorer:
    def __init__(self, X, y, groups):
        self._X = np.array(X)
        self._y = np.array(y)
        self._groups = np.array(groups)
        self._splits = None
        self._current_split = None

def __iter__(self):
        self._splits = iter(GroupShuffleSplit().split(self._X, self._y, self._groups))
        return self

def __next__(self):
        self._current_split = next(self._splits)
        return self._current_split

def get_scorer(self):
        def scorer(y_true, y_pred):
            _, test_idx = self._current_split
            return _score(
                y_true=y_true,
                y_pred=y_pred,
                groups=self._groups[test_idx]
            )

Usage:

qss = QuerySplitScorer(X, y_true, g)
gcv = GridSearchCV(estimator=est, cv=qss, scoring=qss.get_scorer(), param_grid=param_grid, verbose=1)
gcv.fit(X, y_true)

This doesn’t work, self._current_split fixed at the last generated split.

Solution

As I understand it, the scoring values are paired (values, groups), but the estimator should not be used with groups. Let’s cut them in wrapper, but leave them to the scorekeeper.

Simple estimator wrapper wrapper (may require some improvements to fully meet the requirements).

from sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin, clone
from sklearn.linear_model import LogisticRegression
from sklearn.utils.estimator_checks import check_estimator
#from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

class CutEstimator(BaseEstimator):

def __init__(self, base_estimator):
        self.base_estimator = base_estimator

def fit(self, X, y):
        self._base_estimator = clone(self.base_estimator)
        self._base_estimator.fit(X,y[:,0].ravel())
        return self

def predict(self, X):
        return  self._base_estimator.predict(X)

#check_estimator(CutEstimator(LogisticRegression()))

Then we can use it

def my_score(y, y_pred):

return np.sum(y[:,1])

pagam_grid = {'base_estimator__C':[0.2,0.5]}

X=np.random.randn(30,3)
y=np.random.randint(3,size=(X.shape[0],1))
g=np.ones_like(y)

gs = GridSearchCV(CutEstimator(LogisticRegression()),pagam_grid,cv=3,
             scoring=make_scorer(my_score), return_train_score=True
            ).fit(X,np.hstack((y,g)))

print (gs.cv_results_['mean_test_score']) #10 as 30/3
print (gs.cv_results_['mean_train_score']) # 20 as 30 -30/3

Output:

 [ 10.  10.]
 [ 20.  20.]

Update 1: Hacking way but estimator has not changed:

pagam_grid = {'C':[0.2,0.5]}
X=np.random.randn(30,3)
y=np.random.randint(3,size=(X.shape[0]))
g=np.random.randint(3,size=(X.shape[0]))
cv = GroupShuffleSplit (3,random_state=100)
groups_info = {}
for a,b in cv.split(X, y, g):
    groups_info[hash(y[b].tobytes())] =g[b]
    groups_info[hash(y[a].tobytes())] =g[a]

def my_score(y, y_pred):
    global groups_info
    g = groups_info[hash(y.tobytes())]
    return np.sum(g)

gs = GridSearchCV(LogisticRegression(),pagam_grid,cv=cv, 
             scoring=make_scorer(my_score), return_train_score=True,
            ).fit(X,y,groups = g)
print (gs.cv_results_['mean_test_score']) 
print (gs.cv_results_['mean_train_score']) 

Related Problems and Solutions