Custom scoring of GridSearchCV using folding-related parameters
Question
I’m working on a learning ranking problem where the standard is a point evaluation of predictions, but a group evaluation of model performance.
More specifically, the estimator outputs a continuous variable (much like a regressor).
> y = est.predict(X); y
array([71.42857143, 0. , 71.42857143, ..., 0. ,
28.57142857, 0. ])
However, the scoring function requires query aggregation, that is, group prediction, and similar groups
parameters are passed to GridSearchCV
to respect fold partitioning.
> ltr_score(y_true, y_pred, groups=g)
0.023
Obstacles
So far so good. Things got bad when providing custom scoring to GridSearchCV
, I can’t dynamically change the groups
parameter from the scoring feature based on CV folding:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
ltr_scorer = make_scorer(ltr_score, groups=g) # Here's the problem, g is fixed
param_grid = {...}
gcv = GridSearchCV(estimator=est, groups=g, param_grid=param_grid, scoring=ltr_scorer)
What is the easiest way to solve this problem?
A (failed) method
In similar In question, a comment asks/suggests:
Why cant you just store {the grouping column} locally and utilize it if necessary by indexing with the train test indices provided by the splitter?
The OP replied “seems to work”. I also think it’s doable, but not achievable. Obviously, GridSearchCV
will split the index with all cross-validation before performing splitting, fitting, predicting, and scoring. This means that I can’t (ostensibly) try to guess at the original index that created the current split subselection at the time of scoring.
For completeness, my code:
class QuerySplitScorer:
def __init__(self, X, y, groups):
self._X = np.array(X)
self._y = np.array(y)
self._groups = np.array(groups)
self._splits = None
self._current_split = None
def __iter__(self):
self._splits = iter(GroupShuffleSplit().split(self._X, self._y, self._groups))
return self
def __next__(self):
self._current_split = next(self._splits)
return self._current_split
def get_scorer(self):
def scorer(y_true, y_pred):
_, test_idx = self._current_split
return _score(
y_true=y_true,
y_pred=y_pred,
groups=self._groups[test_idx]
)
Usage:
qss = QuerySplitScorer(X, y_true, g)
gcv = GridSearchCV(estimator=est, cv=qss, scoring=qss.get_scorer(), param_grid=param_grid, verbose=1)
gcv.fit(X, y_true)
This doesn’t work, self._current_split
fixed at the last generated split.
Solution
As I understand it, the scoring values are paired (values, groups), but the estimator should not be used with groups. Let’s cut them in wrapper, but leave them to the scorekeeper.
Simple estimator wrapper wrapper (may require some improvements to fully meet the requirements).
from sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin, clone
from sklearn.linear_model import LogisticRegression
from sklearn.utils.estimator_checks import check_estimator
#from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
class CutEstimator(BaseEstimator):
def __init__(self, base_estimator):
self.base_estimator = base_estimator
def fit(self, X, y):
self._base_estimator = clone(self.base_estimator)
self._base_estimator.fit(X,y[:,0].ravel())
return self
def predict(self, X):
return self._base_estimator.predict(X)
#check_estimator(CutEstimator(LogisticRegression()))
Then we can use it
def my_score(y, y_pred):
return np.sum(y[:,1])
pagam_grid = {'base_estimator__C':[0.2,0.5]}
X=np.random.randn(30,3)
y=np.random.randint(3,size=(X.shape[0],1))
g=np.ones_like(y)
gs = GridSearchCV(CutEstimator(LogisticRegression()),pagam_grid,cv=3,
scoring=make_scorer(my_score), return_train_score=True
).fit(X,np.hstack((y,g)))
print (gs.cv_results_['mean_test_score']) #10 as 30/3
print (gs.cv_results_['mean_train_score']) # 20 as 30 -30/3
Output:
[ 10. 10.]
[ 20. 20.]
Update 1: Hacking way but estimator has not changed:
pagam_grid = {'C':[0.2,0.5]}
X=np.random.randn(30,3)
y=np.random.randint(3,size=(X.shape[0]))
g=np.random.randint(3,size=(X.shape[0]))
cv = GroupShuffleSplit (3,random_state=100)
groups_info = {}
for a,b in cv.split(X, y, g):
groups_info[hash(y[b].tobytes())] =g[b]
groups_info[hash(y[a].tobytes())] =g[a]
def my_score(y, y_pred):
global groups_info
g = groups_info[hash(y.tobytes())]
return np.sum(g)
gs = GridSearchCV(LogisticRegression(),pagam_grid,cv=cv,
scoring=make_scorer(my_score), return_train_score=True,
).fit(X,y,groups = g)
print (gs.cv_results_['mean_test_score'])
print (gs.cv_results_['mean_train_score'])