Regularization

09/30/2022

Robert Utterback (based on slides by Andreas Muller)
\( \renewcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\E}{\mathop{\boldsymbol{E}}} \newcommand{\var}{\boldsymbol{Var}} \newcommand{\norm}[1]{\lvert\lvert#1\rvert\rvert} \newcommand{\abs}[1]{\lvert#1\rvert} \newcommand{\ltwo}[1]{\norm{#1}_2} \newcommand{\lone}[1]{\norm{#1}_1} \newcommand{\sgn}[1]{\text{sign}\left( #1 \right)} \newcommand{\e}{\mathrm{e}} \newcommand{\minw}{\min_{w \in \mathbb{R}^p}} \newcommand{\sumn}{\sum_{i=1}^n} \newcommand{\logloss}{\log{(\exp{(-y_iw^T\vec{x}_i)} + 1)}} \)

Linear Regression Review

Linear Models for Regression

linear_regression_1d.png

\[ \hat{y} = w^T \vec{x} + b = \sum_{i=1}^p w_i x_i + b \]

Ordinary Least Squares

\[ \hat{y} = w^T \vec{x} + b = \sum_{i=1}^p w_i x_i + b\] \[ \min_{\vec{w} \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^m \norm{w^T \vec{x}^{(i)} + b - y^{(i)}}^2 \]

Unique solution if \(\vec{X} = (\vec{x}^{(1)},\ldots,\vec{x}^{m})^T\) has full column rank.

Ridge Regression

Ridge Regression

\[ \min_{\vec{w} \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^m \norm{w^T \vec{x}^{(i)} + b - y^{(i)}}^2 + \alpha \ltwo{w} \]

  • Always has a unique solution
  • \(\alpha\) is tuning parameter.

Regularized Empirical Risk Minimization

\[ \min_{f \in F} \sum_{i=1}^m L(f(\vec{x}^{(i)}),y^{(i)}) + \alpha R(f) \]

Reminder on model complexity

overfitting_underfitting_cartoon_full.png

Ames Housing Dataset

ames_housing_scatter.png

print(X.shape, y.shape)
(1460, 80) (1460,)

ames_scaling.png

Coefficient of determination

\[ R^2(y,\hat{y}) = 1 - \frac{\sum_{i=1}^{m} (y^{(i)} - \hat{y}^{(i)})^2}{\sum_{i=1}^{m} (y^{(i)} - \overline{y})^2 } \] \[ \overline{y} = \frac1m \sum_{i=1}^{m} y^{(i)} \] Can be negative for biased estimators - or for the test set!

Preprocessing

cat_pre = make_pipeline( # standardize missing, then OHE
    SimpleImputer(strategy='constant', fill_value='NA'), 
    OneHotEncoder(handle_unknown='ignore'))
num_pre = make_pipeline(SimpleImputer(),StandardScaler())

from sklearn.compose import make_column_selector, \
    make_column_transformer
full_pre = make_column_transformer(
    (cat_pre, make_column_selector(dtype_include='object')),
    remainder=num_pre)

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, random_state=2)
pipe = make_pipeline(full_pre, LinearRegression())
print(cross_val_score(pipe, X_train, y_train, cv=5))
[0.818 0.885 0.892 0.905 0.887]

Note on Skewed Targets


y_train.hist(bins='auto')

ames_target_skewed.png


np.log(y_train).hist(bins='auto')

ames_target_log.png

Handling Transformed Targets

cross_val_score(pipe, X_train, y_train, cv=5)
0.818 0.885 0.892 0.905 0.887
from sklearn.compose import TransformedTargetRegressor
log_linreg = TransformedTargetRegressor(
    LinearRegression(), func=np.log, inverse_func=np.exp)
reg_pipe = make_pipeline(full_pre, log_linreg)
cross_val_score(reg_pipe, X_train, y_train, cv=5)
0.882 0.896 0.905 0.902 0.911

Linear Regression vs. Rdige

from sklearn.compose import TransformedTargetRegressor
cross_val_score(reg_pipe, X_train, y_train, cv=5)
0.882 0.896 0.905 0.902 0.911
log_ridge = TransformedTargetRegressor(
    Ridge(), func=np.log, inverse_func=np.exp)
ridge_pipe = make_pipeline(full_pre, log_ridge)
cross_val_score(ridge_pipe, X_train, y_train, cv=5)
0.898 0.911 0.942 0.909 0.914

Tuning Ridge Regression

pipe = Pipeline([('pre', full_pre), ('ridge', log_ridge)])
param_grid = {'ridge__regressor__alpha': np.logspace(-3, 3, 30)}
grid = GridSearchCV(pipe, param_grid, return_train_score=True)
grid.fit(X_train, y_train)
print(f"{grid.best_score_:3f}")
0.924953

ames_ridge_gridsearch.png

Triazine Dataset

triazines = fetch_openml('triazines', version=1)
print(triazines.data.shape)
(186, 60)

pd.Series(triazines.target).hist()

triazine-target.png

X_train, X_test, y_train, y_test = \
    train_test_split(triazines.data, triazines.target,
                     random_state=0)
print(cross_val_score(LinearRegression(), X_train, y_train, cv=5))
print(cross_val_score(Ridge(), X_train, y_train, cv=5))
[ 0.213  0.129 -0.796 -0.222 -0.155]
[0.263 0.455 0.024 0.23  0.036]

param_grid = {'alpha': np.logspace(-3,3,30)}
cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
grid = GridSearchCV(Ridge(), param_grid,
                    cv=cv, return_train_score=True)
grid.fit(X_train, y_train)

triazine-ridge-grid.png

Plotting coefficient values for LR


lr = LinearRegression().fit(X_train, y_train)
plt.scatter(range(X_train.shape[1]), lr.coef_,
	      c=np.sign(lr.coef_), cmap="bwr_r")
plt.xlabel('feature index'); plt.ylabel('regression coefficient')

triazine-lr-coef.png

Ridge Coefficients


ridge = grid.best_estimator_
plt.scatter(range(X_train.shape[1]), ridge.coef_,
	      c=np.sign(ridge.coef_), cmap="bwr_r")
plt.xlabel('feature index'); plt.ylabel('regression coefficient')

triazine-ridge-coef.png

Boston LR Coefficients

lr_coefficients_large.png

Ridge Coefficients By alpha


ridge100 = Ridge(alpha=100).fit(X_train, y_train)
ridge1 = Ridge(alpha=1).fit(X_train, y_train)
plt.figure(figsize=(8, 4))
plt.plot(ridge1.coef_, 'o', label="alpha=1")
plt.plot(ridge.coef_, 'o', label=f"alpha={ridge.alpha:.2f}")
plt.plot(ridge100.coef_, 'o', label="alpha=100")
plt.legend()

triazine-ridge-coefs-by-alpha.png

Learning Curves

ridge_learning_curve.png

Lasso Regression

Lasso Regression

\[ \min_{w \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^m \norm{w^T \vec{x}^{(i)} +b - y^{(i)}}^2 + \alpha \lone{w} \]

  • Shrinks \(\vec{w}\) towards zero like Ridge
  • Sets some \(\vec{w}\) exactly to zero
  • Automatic feature selection!

Grid-Search for Lasso

from sklearn.linear_model import Lasso
param_grid = {'alpha': np.logspace(-5, 0, 20)}
grid = GridSearchCV(Lasso(max_iter=10000), param_grid, cv=10)
grid.fit(X_train, y_train)

print(grid.best_params_)
print(f"{grid.best_score_:.3f}")
{'alpha': 0.0012742749857031334}
0.169

Grid-Search for Lasso

lasso_alpha_triazine.png

Coefficients for Lasso

lasso_coefficients.png

lasso = grid.best_estimator_
print(X_train.shape)
print(np.sum(lasso.coef_ != 0))
(139, 60)
14

Understanding Penalties

Understanding L1 and L2 Penalties

l2_l1_l0.png

\(\ell_2(w) = \sum_i \sqrt{w_i^2}\)

\(\ell_1(w) = \sum_i \abs{w_i}\)

\(\ell_0(w) = \sum_i \mathbf{1}\{w_i \ne 0\}\)

Understanding L1 and L2 Penalties

l1_kink.png

\[ f(x) = (2x-1)^2 \]

\[ f(x) + L2 = (2x-1)^2 + \alpha x^2 \]

\[ f(x) + L1 = (2x-1)^2 + \alpha \abs{x} \]

Understanding L1 and L2 Penalties

l1l2ball.png

Understanding L1 and L2 Penalties

l1l2ball_intersect.png

Elastic Net

Elastic Net

\[ \min_{\vec{w} \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^m \norm{w^T \vec{x}^{(i)} + b - y^{(i)}}^2 + \alpha_1 \lone{w} + \alpha_2 \ltwo{w}^2 \]

  • Combines benefits for ridge and lasso
  • Must tune two parameters

Comparing Unit Norm Contours

l1l2_elasticnet.png

Parameterization in scikit-learn

\[ \min_{\vec{w} \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^m \norm{w^T \vec{x}^{(i)} + b - y^{(i)}}^2 + \alpha\eta \lone{w} + \alpha(1-\eta) \ltwo{w}^2 \]

  • Where \(\eta\) is the relative amount of L1 penalty.
  • In sklearn: l1_ratio

Grid-Search for Elastic Net

from sklearn.linear_model import ElasticNet
param_grid = {'alpha': np.logspace(-4, -1, 10),
              'l1_ratio': [0.01, .1, .5, .8, .9, .95, .98, 1]}
grid = GridSearchCV(ElasticNet(), param_grid, cv=10)
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.best_score_)
print((grid.best_estimator_.coef_ != 0).sum())
{'alpha': 0.002154434690031882, 'l1_ratio': 0.5}
0.17410151837759943
16

Analyzing 2D Grid Search


table = pd.pivot_table(pd.DataFrame(grid.cv_results_),
    values='mean_test_score', index='param_alpha', columns='param_l1_ratio')
import seaborn as sns
ax = sns.heatmap(table, annot=True, fmt=".2g")
ax.set_yticklabels([round(x,4) for x in table.index])
ax.collections[0].colorbar.set_label(r'$R^2$', rotation=0)
plt.gcf().tight_layout()

triazine-elasticnet-gridsearch.png

triazine-elasticnet-gridsearch.png

Robust Regression

Robust Regression

robust_regression.png

Least Squares Fit to Outlier Data

outliers_least_squares.png

Robust Fit

outliers_robust_fit.png

Huber Loss

huber_loss.png

\[ \min_{\vec{w},\sigma} \sum_{i=1}^m \left( \sigma + H\left( \frac{X_i w_i - y_i}{\sigma} \right) \sigma \right) \\+ \alpha \ltwo{w}^2 \]

\[ H(z) = \begin{cases} z^2, & \qquad\text{if $|z| < \epsilon$}\\ 2\epsilon \abs{z} & \qquad\text{else} \end{cases} \]

RANSAC

ransac_algorithm.png

RANSAC

ransac_algorithm2.png