Calibration & Class Imbalance

09/19/2022

Robert Utterback (based on slides by Andreas Muller)
\( \renewcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\E}{\mathop{\boldsymbol{E}}} \newcommand{\var}{\boldsymbol{Var}} \newcommand{\norm}[1]{\lvert\lvert#1\rvert\rvert} \newcommand{\abs}[1]{\lvert#1\rvert} \newcommand{\ltwo}[1]{\norm{#1}_2} \newcommand{\lone}[1]{\norm{#1}_1} \newcommand{\sgn}[1]{\text{sign}\left( #1 \right)} \newcommand{\e}{\mathrm{e}} \newcommand{\minw}{\min_{w \in \mathbb{R}^p}} \newcommand{\sumn}{\sum_{i=1}^n} \newcommand{\logloss}{\log{(\exp{(-y_iw^T\vec{x}_i)} + 1)}} \)

Calibration

  • Probabilities can be much more informative than labels:
  • "The model predicted you don’t have cancer" vs "The model predicted you’re 40% likely to have cancer"

Calibration curve (Reliability diagram)

prob_table.png

calib_curve.png

calibration_curve with sklearn

Using subsample of covertype dataset

from sklearn.linear_model import LogisticRegressionCV
print(X_train.shape)
print(np.bincount(y_train))
lr = LogisticRegressionCV().fit(X_train, y_train)
# (52292, 54)
# [19036 33256]
lr.C_
array([ 2.783])
print(lr.predict_proba(X_test)[:10])
print(y_test[:10])
# [[ 0.681  0.319]
#  [ 0.049  0.951]
#  [ 0.706  0.294]
#  [ 0.537  0.463]
#  [ 0.819  0.181]
#  [ 0.     1.   ]
#  [ 0.794  0.206]
#  [ 0.676  0.324]
#  [ 0.727  0.273]
#  [ 0.597  0.403]]
# [0 1 0 1 1 1 0 0 0 1]
from sklearn.calibration import calibration_curve
probs = lr.predict_proba(X_test)[:, 1]
prob_true, prob_pred = calibration_curve(y_test, probs, n_bins=5)
print(prob_true)
print(prob_pred)
# [ 0.2    0.303  0.458  0.709  0.934]
# [ 0.138  0.306  0.498  0.701  0.926]

predprob_positive.png

Influence of number of bins

influence_bins.png

Comparing Models

calib_curve_models.png

Brier Score (for binary classification)

  • "mean squared error of probability estimate"

\[ BS = \frac{\sum_{i=1}^{n} (\widehat{p} (y_i)-y_i)^{2}}{n}\]

models_bscore.png

Fixing it: Calibrating a classifier

  • Build another model, mapping classifier probabilities to better probabilities!
  • 1d model! (or more for multi-class)

\[ f_{callib}(s(x)) \approx p(y)\]

  • s(x) is score given by model, usually
  • Can also work with models that don’t even provide probabilities! Need model for \(f_{callib}\), need to decide what data to train it on.
  • Can train on training set \(\to\) Overfit
  • Can train using cross-validation \(\to\) use data, slower

Platt Scaling

  • Use a logistic sigmoid for \(f_{callib}\)
  • Basically learning a 1d logistic regression
  • (+ some tricks)
  • Works well for SVMs

\[f_{platt} = \frac{1}{1 + exp(-ws(x))}\]

Isotonic Regression

  • Very flexible way to specify \(f_{callib}\)
  • Learns arbitrary monotonically increasing step-functions in 1d.
  • Groups data into constant parts, steps in between.
  • Optimum monotone function on training data (wrt mse).

isotonic_regression.png

Building the model

  • Using the training set is bad
  • Either use hold-out set or cross-validation
  • Cross-validation can be use as in stacking to make unbiased probability predictions, use that as training set.

CalibratedClassifierCV

from sklearn.calibration import CalibratedClassifierCV
X_train_sub, X_val, y_train_sub, y_val = \
    train_test_split(X_train, y_train,
                     stratify=y_train, random_state=0)
rf = RandomForestClassifier(n_estimators=100).fit(X_train_sub, y_train_sub)
scores = rf.predict_proba(X_test)[:, 1]
plot_calibration_curve(y_test, scores, n_bins=20)

random_forest.png

Calibration on Random Forest

cal_rf = CalibratedClassifierCV(rf, cv="prefit",
                                method='sigmoid')
cal_rf.fit(X_val, y_val)
scores_sigm = cal_rf.predict_proba(X_test)[:, 1]

cal_rf_iso = CalibratedClassifierCV(rf, cv="prefit",
                                    method='isotonic')
cal_rf_iso.fit(X_val, y_val)
scores_iso = cal_rf_iso.predict_proba(X_test)[:, 1]

types_callib.png

Cross-validated Calibration

cal_rf_iso_cv = CalibratedClassifierCV(rf, method='isotonic')
cal_rf_iso_cv.fit(X_train, y_train)
scores_iso_cv = cal_rf_iso_cv.predict_proba(X_test)[:, 1]

types_callib_cv.png

Multi-Class Calibration

multi_class_calibration.png

Class Imbalance

Two sources of imbalance

  • Asymmetric cost
  • Asymmetric data

Why do we care?

  • Why should cost be symmetric?
  • All data is imbalanced
  • Detect rare events

Changing Thresholds

data = load_breast_cancer()
lr = LogisticRegression(solver='lbfgs', max_iter=10000)
X_train, X_test, y_train, y_test = \
    train_test_split(data.data, data.target,
                     stratify=data.target, random_state=0)

lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.91      0.92      0.92        53
           1       0.96      0.94      0.95        90

    accuracy                           0.94       143
   macro avg       0.93      0.93      0.93       143
weighted avg       0.94      0.94      0.94       143
y_pred = lr.predict_proba(X_test)[:, 1] > .85

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.84      1.00      0.91        53
           1       1.00      0.89      0.94        90

    accuracy                           0.93       143
   macro avg       0.92      0.94      0.93       143
weighted avg       0.94      0.93      0.93       143

roc curve

roc_svc_rf_curve.png

remedies for the model

Mammography Data

from sklearn.datasets import fetch_openml
data = fetch_openml('mammography', as_frame=True)
X, y = data.data, data.target
print(X.shape)
(11183, 6)
print(y.value_counts())
-1    10923
1       260
Name: class, dtype: int64

mammography_data.png

Mammography Data

lr = LogisticRegression(solver='lbfgs')
scores = cross_validate(lr, X_train, y_train, cv=10,
                        scoring=('roc_auc', 'average_precision'))
roc = scores['test_roc_auc'].mean()
avep = scores['test_average_precision'].mean()
print(f"{roc:.3f}, {avep:.3f}")
0.918, 0.631
rf = RandomForestClassifier(n_estimators=100)
scores = cross_validate(rf, X_train, y_train, cv=10,
                        scoring=('roc_auc', 'average_precision'))
roc = scores['test_roc_auc'].mean()
avep = scores['test_average_precision'].mean()
print(f"{roc:.3f}, {avep:.3f}")
0.943, 0.738

Mammography Data

mammography_features23.png

Basic Approaches

basic_approaches.png

Change the training procedure

Change the Data: Sampling

Scikit-learn vs. resampling

pipeline.png

Imbalance-learn

http://imbalanced-learn.org

!pip install imbalanced-learn
# or conda install ...

Extends sklearn API

Sampler

  • To resample data sets, each sampler implements
data_resampled, targets_resampled = obj.resample(data, targets)
  • Fitting and sampling in one step:
data_resampled, targets_resampled = \
	obj.fit_resample(data, targets)
  • In pipelines: sampling only done in fit!

Random Undersampling

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(replacement=False)
X_train_subsample, y_train_subsample = \
    rus.fit_resample(X_train, y_train)
print(X_train.shape)
print(X_train_subsample.shape)
print(np.bincount(y_train_subsample))
(8387, 6)
(392, 6)
[196 196]

Random Undersampling

from imblearn.pipeline import make_pipeline as make_imb_pipeline
undersample_pipe = make_imb_pipeline(
    RandomUnderSampler(random_state=57), lr)
scores = cross_validate(undersample_pipe,
                        X_train, y_train, cv=10,
                        scoring=('roc_auc', 'average_precision'))
roc = scores['test_roc_auc'].mean()
avep = scores['test_average_precision'].mean()
print(f"{roc:.3f}, {avep:.3f}")
# was 0.918, 0.631 without
0.921, 0.580
undersample_pipe_rf = \
    make_imb_pipeline(RandomUnderSampler(random_state=0),
                      RandomForestClassifier(n_estimators=100))
scores = cross_validate(undersample_pipe_rf,
                        X_train, y_train, cv=10, 
                        scoring=('roc_auc', 'average_precision'))
roc = scores['test_roc_auc'].mean()
avep = scores['test_average_precision'].mean()
print(f"{roc:.3f}, {avep:.3f}")
# was 0.943, 0.738 without
0.945, 0.594

Random Oversampling

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_train_oversample, y_train_oversample = \
    ros.fit_resample(X_train, y_train)
print(X_train.shape)
print(X_train_oversample.shape)
print(np.bincount(y_train_oversample))
(8387, 6)
(16382, 6)
[8191 8191]

Random Oversampling

oversample_pipe = make_imb_pipeline(
    RandomOverSampler(random_state=0), lr)

scores = cross_validate(oversample_pipe,
                        X_train, y_train, cv=10,
                        scoring=('roc_auc', 'average_precision'))
roc = scores['test_roc_auc'].mean()
avep = scores['test_average_precision'].mean()
print(f"{roc:.3f}, {avep:.3f}")
# was 0.918, 0.631 without
0.925, 0.570
oversample_pipe_rf = \
    make_imb_pipeline(RandomOverSampler(random_state=0),
                      RandomForestClassifier(n_estimators=100))
scores = cross_validate(oversample_pipe_rf,
                        X_train, y_train, cv=10,
                        scoring=('roc_auc', 'average_precision'))
roc = scores['test_roc_auc'].mean()
avep = scores['test_average_precision'].mean()
print(f"{roc:.3f}, {avep:.3f}")
# was 0.943, 0.738 without
0.923, 0.708

Curves for LogReg

curves_logreg.png

Curves for Random Forest

curves_rf.png

ROC or PR?

FPR or Precision? \[ \large\text{FPR} = \frac{\text{FP}}{\text{FP}+\text{TN}}\] \[ \large\text{Precision} = \frac{\text{TP}}{\text{TP}+\text{FP}}\]

Change the Training

Class-weights

  • Instead of repeating samples, re-weight the loss function.
  • Works for most models!
  • Same effect as over-sampling (though not random), but not as expensive (dataset size the same).

Class-weights in linear models

\[\min_{w \in ℝ^{p}}-C \sum_{i=1}^n\log(\exp(-y_iw^T \textbf{x}_i) + 1) + ||w||_2^2\] \[ \min_{w \in \mathbb{R}^p} -C \sum_{i=1}^n c_{y_i} \log(\exp(-y_i w^T \mathbf{x}_i) + 1) + ||w||^2_2 \] Similar for linear and non-linear SVM

Class weights in trees

Gini Index: \[H_\text{gini}(X_m) = \sum_{k\in\mathcal{Y}} p_{mk} (1 - p_{mk})\] \[H_\text{gini}(X_m) = \sum_{k\in\mathcal{Y}} c_k p_{mk} (1 - p_{mk})\] Prediction: Weighted vote

Using Class Weights

lr = LogisticRegression(solver='lbfgs',
                        class_weight='balanced')
scores = cross_validate(lr, X_train, y_train, cv=10,
                        scoring=('roc_auc',
                                 'average_precision'))
roc = scores['test_roc_auc'].mean()
avep = scores['test_average_precision'].mean()
print(f"{roc:.3f}, {avep:.3f}")
0.923, 0.564
rf = RandomForestClassifier(n_estimators=100,
                            class_weight='balanced')
scores = cross_validate(rf, X_train, y_train, cv=10,
                        scoring=('roc_auc',
                                 'average_precision'))
roc = scores['test_roc_auc'].mean()
avep = scores['test_average_precision'].mean()
print(f"{roc:.3f}, {avep:.3f}")

>>> 0.915, 0.707

Ensemble Resampling

  • Random resampling separate for each instance in an ensemble!
  • Paper: "Exploratory Undersampling for Class Imbalance Learning"
  • Not in sklearn (yet)
  • Easy with imblearn

Easy Ensemble with imblearn

from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_features='sqrt')
from imblearn.ensemble import BalancedBaggingClassifier
resampled_rf = \
    BalancedBaggingClassifier(base_estimator=tree,
                              n_estimators=100, random_state=0)
scores = cross_validate(resampled_rf, X_train, y_train,
                        cv=10,
                        scoring=('roc_auc', 'average_precision'))
roc = scores['test_roc_auc'].mean()
avep = scores['test_average_precision'].mean()
print(f"{roc:.3f}, {avep:.3f}")
0.949, 0.668

roc_vs_pr.png

Smart resampling

(based on nearest neighbour heuristics from the 70's)

Edited Nearest Neighbours

  • Originally as heuristic for reducing dataset for KNN
  • Remove all samples that are misclassified by KNN from training data (mode) or that have any point from other class as neighbor (all).
  • "Cleans up" outliers and boundaries.

Edited Nearest Neighbours

edited_nearest_neighbour.png

Edited Nearest Neighbours

from imblearn.under_sampling import EditedNearestNeighbours
enn = EditedNearestNeighbours(n_neighbors=5)
X_train_enn, y_train_enn = enn.fit_resample(X_train, y_train)

enn_mode = EditedNearestNeighbours(kind_sel="mode", 
                                   n_neighbors=5)
X_train_enn_mode, y_train_enn_mode = \
    enn_mode.fit_resample(X_train, y_train)

edited_nearest_neighbour_2.png

enn_pipe = make_imb_pipeline(EditedNearestNeighbours(n_neighbors=5),
                             LogisticRegression(solver='lbfgs'))
scores = cross_val_score(enn_pipe, X_train, y_train, cv=10, scoring='roc_auc')
print(f"{np.mean(scores):.3f}")
0.921
enn_pipe_rf = make_imb_pipeline(EditedNearestNeighbours(n_neighbors=5),
                                  RandomForestClassifier(n_estimators=100))
scores = cross_val_score(enn_pipe_rf, X_train, y_train, cv=10, scoring='roc_auc')
print(f"{np.mean(scores):.3f}")
0.941

Condensed Nearest Neighbors

  • Iteratively adds points to the data that are misclassified by KNN
  • Focuses on the boundaries
  • Usually removes many
from imblearn.under_sampling import CondensedNearestNeighbour
cnn = CondensedNearestNeighbour()
X_train_cnn, y_train_cnn = cnn.fit_resample(X_train, y_train)
print(X_train_cnn.shape)
print(np.bincount(y_train_cnn))
(553, 6)
[357 196]

edited_condensed_nn.png

cnn_pipe = make_imb_pipeline(CondensedNearestNeighbour(),
                             LogisticRegression(solver='lbfgs'))
scores = cross_val_score(cnn_pipe, X_train, y_train, cv=10, scoring='roc_auc')
print(f"{np.mean(scores):.3f}")
0.916
rf = RandomForestClassifier(n_estimators=100, random_state=0)
cnn_pipe = make_imb_pipeline(CondensedNearestNeighbour(), rf)
scores = cross_val_score(cnn_pipe, X_train, y_train, cv=10, scoring='roc_auc')
print(f"{np.mean(scores):.2f}")
0.93

Synthetic Sample Generation

Synthetic Minority Oversampling Technique (SMOTE)

  • Adds synthetic interpolated data to smaller class
  • For each sample in minority class:
    • Pick random neighbor from k neighbors.
    • Pick point on line connecting the two uniformly
    • Repeat.

smote.png

smote_3.png

from imblearn.over_sampling import SMOTE
smote_pipe = make_imb_pipeline(SMOTE(),
                               LogisticRegression(solver='lbfgs'))
scores = cross_val_score(smote_pipe, X_train, y_train,
                         cv=10, scoring='roc_auc')
print(f"{np.mean(scores):.3f}")
0.922
smote_pipe_rf = \
    make_imb_pipeline(SMOTE(),
                      RandomForestClassifier(n_estimators=100))
scores = cross_val_score(smote_pipe_rf,X_train,y_train,
                         cv=10, scoring='roc_auc')
print(f"{np.mean(scores):.3f}")
0.943
param_grid = {'smote__k_neighbors': [3, 5, 7, 9, 11, 15, 31]}
search = GridSearchCV(smote_pipe_rf, param_grid,
                      cv=10, scoring="roc_auc")
search.fit(X_train, y_train)
print(f"{search.score(X_test, y_test):.3f}")
0.959

param_smote_k_neighbours.png

smote_k_neighbours.png