Regularization

09/30/2022

Robert Utterback (based on slides by Andreas Muller)

$ \renewcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\E}{\mathop{\boldsymbol{E}}} \newcommand{\var}{\boldsymbol{Var}} \newcommand{\norm}[1]{\lvert\lvert#1\rvert\rvert} \newcommand{\abs}[1]{\lvert#1\rvert} \newcommand{\ltwo}[1]{\norm{#1}_2} \newcommand{\lone}[1]{\norm{#1}_1} \newcommand{\sgn}[1]{\text{sign}\left( #1 \right)} \newcommand{\e}{\mathrm{e}} \newcommand{\minw}{\min_{w \in \mathbb{R}^p}} \newcommand{\sumn}{\sum_{i=1}^n} \newcommand{\logloss}{\log{(\exp{(-y_iw^T\vec{x}_i)} + 1)}} $

Linear Regression Review

Linear Models for Regression

\[ \hat{y} = w^T \vec{x} + b = \sum_{i=1}^p w_i x_i + b \]

Predictions in all linear models for regression are of the form shown here: It's an inner product of the features with some coefficient or weight vector w, and some bias or intercept b. In other words, the output is a weighted sum of the inputs, possibly with a shift. here i runs over the features and x_i is one feature of the data point x. These models are called linear models because they are linear in the parameters w. The way I wrote it down here they are also linear in the features x_i. However, you can replace the features by any non-linear function of the inputs, and it'll still be a linear model.

There are many differnt linear models for regression, and they all share this formula for making predictions. The difference between them is in how they find w and b based on the training data.

Ordinary Least Squares

\[ \hat{y} = w^T \vec{x} + b = \sum_{i=1}^p w_i x_i + b\] \[ \min_{\vec{w} \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^m \norm{w^T \vec{x}^{(i)} + b - y^{(i)}}^2 \]

Unique solution if $\vec{X} = (\vec{x}^{(1)},\ldots,\vec{x}^{m})^T$ has full column rank.

The most straight-forward solution that goes back to Gauss is ordinary least squares. In ordinary least squares, find w and b such that the predictions on the training set are as accurate as possible according the the squared error. That intuitively makes sense: we want the predictions to be good on the training set. If there is more samples than features (and the samples span the whole feature space), then there is a unique solution. The problem is what's called a least squares problem, which is particularly easy to optimize and get the unique solution to.

However, if there are more features than samples, there are usually many perfect solutions that lead to 0 error on the training set. Then it's not clear which solution to pick. Even if there are more samples than features, if there are strong correlations among features the results might be unstable, and we'll see some examples of that soon.

Before we look at examples, I want to introduce a popular alternative.

Ridge Regression

\[ \min_{\vec{w} \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^m \norm{w^T \vec{x}^{(i)} + b - y^{(i)}}^2 + \alpha \ltwo{w} \]

Always has a unique solution
$\alpha$ is tuning parameter.

In Ridge regression we add another term to the optimization problem. Not only do we want to fit the training data well, we also want w to have a small squared l2 norm or squared euclidean norm. The idea here is that we're decreasing the "slope" along each of the feature by pushing the coefficients towards zero. This constrains the model to be more simple.

So there are two terms in this optimization problem, which is also called the objective function of the model: the data-fitting term here that wants to be close to the training data according to the squared norm, and the penalty or regularization term here that wants w to have small norm, and that doesn't depend on the data.

Usually these two goals are somewhat opposing. If we made w zero, the second term would be zero, but the predictions would be bad. So we need to trade off between these two. The trade off is problem specific and is specified by the user. If we set alpha to zero, we get linear regression, if we set alpha to infinity we get a constant model. Obviously usually we want something in between.

Note: In many papers (and the book) you'll see $\frac12$ because it makes the gradient easier (I think?), but I'm pretty sure sklearn does not use it. The effect is only to change what alpha means slightly, so it doesn't really matter. Also note that the bias is not regularlized.

This is a very typical example of a general principle in machine learning, called regularized empirical risk minimization.

Regularized Empirical Risk Minimization

\[ \min_{f \in F} \sum_{i=1}^m L(f(\vec{x}^{(i)}),y^{(i)}) + \alpha R(f) \]

Many models in machine learning, like linear models, SVMs and neural networks follow the general framework of empirical risk minimization, which you can see here. We formulate the machine learning problem as an optimization problem over a family of functions. In our case that was the family of linear functions parametrized by w and b. The minimization problem consists of two parts, the data fitting part and the model complexity part. The data fitting part says that the predictions mad eby our functions should be accurate according to some loss L. For our regression problems that was the squared loss. The model complexity part says that we prefer simple models and penalizes complicated f. Most machine learning algorithms can be cast into this, with a particular choice of family of functions f, loss function L and regularizer R. And most of machine learning theory is build around this framework. People proof for differnt choices of F and L and R that if you minimize this, you'll be able to generalize well. And that makes intuitive sense. To do well on the test set, we definitely want to do reasonably well on the training set. We don't expect that we can do better on a test set than the training set. But we also want to minimize the performance difference between training and test set. If we restrict our model to be simple via the regularizer R, we have better chances of the model generalizing.

Reminder on model complexity

Ames Housing Dataset

print(X.shape, y.shape)

(1460, 80) (1460,)

Ok after all this pretty abstract talk, let's make this concrete. Let's do some regression on the Ames housing dataset – more modern than Boston.

Keep in mind that this data lives in a 80 dimensional space and these univariate plots only look at 80 different projections of the data, and can't capture any of the interactions.

But still we can see that the price clearly depends on some of these variables. It's also pretty clear that the dependency is non-linear for some of the variables. We'll still start with a linear model, because its a very simple class of models, and I'd always star approaching any model from the simplest baseline. In this case it's linear regression. We're having 1460 samples and 80 features. We have much more samples than features. Linear regression should work just fine. Also it's a tiny dataset, so basically anything we'll try will run instantaneously, which is also good to keep in mind.

Another thing that you can see in this graph is that the features have very different scales. Here's a box plot that shows that even more clearly.

Coefficient of determination

\[ R^2(y,\hat{y}) = 1 - \frac{\sum_{i=1}^{m} (y^{(i)} - \hat{y}^{(i)})^2}{\sum_{i=1}^{m} (y^{(i)} - \overline{y})^2 } \] \[ \overline{y} = \frac1m \sum_{i=1}^{m} y^{(i)} \] Can be negative for biased estimators - or for the test set!

Preprocessing

cat_pre = make_pipeline( # standardize missing, then OHE
    SimpleImputer(strategy='constant', fill_value='NA'), 
    OneHotEncoder(handle_unknown='ignore'))
num_pre = make_pipeline(SimpleImputer(),StandardScaler())

from sklearn.compose import make_column_selector, \
    make_column_transformer
full_pre = make_column_transformer(
    (cat_pre, make_column_selector(dtype_include='object')),
    remainder=num_pre)

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, random_state=2)
pipe = make_pipeline(full_pre, LinearRegression())
print(cross_val_score(pipe, X_train, y_train, cv=5))

[0.818 0.885 0.892 0.905 0.887]

Note on Skewed Targets


y_train.hist(bins='auto')


np.log(y_train).hist(bins='auto')

Handling Transformed Targets

cross_val_score(pipe, X_train, y_train, cv=5)

0.818

0.885

0.892

0.905

0.887

from sklearn.compose import TransformedTargetRegressor
log_linreg = TransformedTargetRegressor(
    LinearRegression(), func=np.log, inverse_func=np.exp)
reg_pipe = make_pipeline(full_pre, log_linreg)
cross_val_score(reg_pipe, X_train, y_train, cv=5)

0.882

0.896

0.905

0.902

0.911

Linear Regression vs. Rdige

from sklearn.compose import TransformedTargetRegressor
cross_val_score(reg_pipe, X_train, y_train, cv=5)

0.882

0.896

0.905

0.902

0.911

log_ridge = TransformedTargetRegressor(
    Ridge(), func=np.log, inverse_func=np.exp)
ridge_pipe = make_pipeline(full_pre, log_ridge)
cross_val_score(ridge_pipe, X_train, y_train, cv=5)

0.898

0.911

0.942

0.909

0.914

Tuning Ridge Regression

pipe = Pipeline([('pre', full_pre), ('ridge', log_ridge)])
param_grid = {'ridge__regressor__alpha': np.logspace(-3, 3, 30)}
grid = GridSearchCV(pipe, param_grid, return_train_score=True)
grid.fit(X_train, y_train)
print(f"{grid.best_score_:3f}")

0.924953

Triazine Dataset

triazines = fetch_openml('triazines', version=1)
print(triazines.data.shape)

(186, 60)


pd.Series(triazines.target).hist()

X_train, X_test, y_train, y_test = \
    train_test_split(triazines.data, triazines.target,
                     random_state=0)
print(cross_val_score(LinearRegression(), X_train, y_train, cv=5))
print(cross_val_score(Ridge(), X_train, y_train, cv=5))

[ 0.213  0.129 -0.796 -0.222 -0.155]
[0.263 0.455 0.024 0.23  0.036]

param_grid = {'alpha': np.logspace(-3,3,30)}
cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
grid = GridSearchCV(Ridge(), param_grid,
                    cv=cv, return_train_score=True)
grid.fit(X_train, y_train)

Plotting coefficient values for LR


lr = LinearRegression().fit(X_train, y_train)
plt.scatter(range(X_train.shape[1]), lr.coef_,
	      c=np.sign(lr.coef_), cmap="bwr_r")
plt.xlabel('feature index'); plt.ylabel('regression coefficient')

Ridge Coefficients


ridge = grid.best_estimator_
plt.scatter(range(X_train.shape[1]), ridge.coef_,
	      c=np.sign(ridge.coef_), cmap="bwr_r")
plt.xlabel('feature index'); plt.ylabel('regression coefficient')

Boston LR Coefficients

Ridge Coefficients By alpha


ridge100 = Ridge(alpha=100).fit(X_train, y_train)
ridge1 = Ridge(alpha=1).fit(X_train, y_train)
plt.figure(figsize=(8, 4))
plt.plot(ridge1.coef_, 'o', label="alpha=1")
plt.plot(ridge.coef_, 'o', label=f"alpha={ridge.alpha:.2f}")
plt.plot(ridge100.coef_, 'o', label="alpha=100")
plt.legend()

Learning Curves

Lasso Regression

\[ \min_{w \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^m \norm{w^T \vec{x}^{(i)} +b - y^{(i)}}^2 + \alpha \lone{w} \]

Shrinks $\vec{w}$ towards zero like Ridge
Sets some $\vec{w}$ exactly to zero
Automatic feature selection!

Grid-Search for Lasso

from sklearn.linear_model import Lasso
param_grid = {'alpha': np.logspace(-5, 0, 20)}
grid = GridSearchCV(Lasso(max_iter=10000), param_grid, cv=10)
grid.fit(X_train, y_train)

print(grid.best_params_)
print(f"{grid.best_score_:.3f}")

{'alpha': 0.0012742749857031334}
0.169

Grid-Search for Lasso

Coefficients for Lasso

lasso = grid.best_estimator_
print(X_train.shape)
print(np.sum(lasso.coef_ != 0))

(139, 60)
14

Understanding Penalties

Understanding L1 and L2 Penalties

$\ell_2(w) = \sum_i \sqrt{w_i^2}$

$\ell_1(w) = \sum_i \abs{w_i}$

$\ell_0(w) = \sum_i \mathbf{1}\{w_i \ne 0\}$

Understanding L1 and L2 Penalties

\[ f(x) = (2x-1)^2 \]

\[ f(x) + L2 = (2x-1)^2 + \alpha x^2 \]

\[ f(x) + L1 = (2x-1)^2 + \alpha \abs{x} \]

Understanding L1 and L2 Penalties

Elastic Net

\[ \min_{\vec{w} \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^m \norm{w^T \vec{x}^{(i)} + b - y^{(i)}}^2 + \alpha_1 \lone{w} + \alpha_2 \ltwo{w}^2 \]

Combines benefits for ridge and lasso
Must tune two parameters

Comparing Unit Norm Contours

Parameterization in scikit-learn

\[ \min_{\vec{w} \in \mathbb{R}^p, b\in\mathbb{R}} \sum_{i=1}^m \norm{w^T \vec{x}^{(i)} + b - y^{(i)}}^2 + \alpha\eta \lone{w} + \alpha(1-\eta) \ltwo{w}^2 \]

Where $\eta$ is the relative amount of L1 penalty.
In sklearn: l1_ratio

Grid-Search for Elastic Net

from sklearn.linear_model import ElasticNet
param_grid = {'alpha': np.logspace(-4, -1, 10),
              'l1_ratio': [0.01, .1, .5, .8, .9, .95, .98, 1]}
grid = GridSearchCV(ElasticNet(), param_grid, cv=10)
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.best_score_)
print((grid.best_estimator_.coef_ != 0).sum())

{'alpha': 0.002154434690031882, 'l1_ratio': 0.5}
0.17410151837759943
16

Analyzing 2D Grid Search


table = pd.pivot_table(pd.DataFrame(grid.cv_results_),
    values='mean_test_score', index='param_alpha', columns='param_l1_ratio')
import seaborn as sns
ax = sns.heatmap(table, annot=True, fmt=".2g")
ax.set_yticklabels([round(x,4) for x in table.index])
ax.collections[0].colorbar.set_label(r'$R^2$', rotation=0)
plt.gcf().tight_layout()

Robust Regression

Least Squares Fit to Outlier Data

Robust Fit

Huber Loss

\[ \min_{\vec{w},\sigma} \sum_{i=1}^m \left( \sigma + H\left( \frac{X_i w_i - y_i}{\sigma} \right) \sigma \right) \\+ \alpha \ltwo{w}^2 \]

\[ H(z) = \begin{cases} z^2, & \qquad\text{if $|z| < \epsilon$}\\ 2\epsilon \abs{z} & \qquad\text{else} \end{cases} \]

Regularization

09/30/2022

Robert Utterback (based on slides by Andreas Muller)

Linear Regression Review

Linear Models for Regression

Ordinary Least Squares

Ridge Regression

Ridge Regression

Regularized Empirical Risk Minimization

Reminder on model complexity

Ames Housing Dataset

Coefficient of determination

Preprocessing

Note on Skewed Targets

Handling Transformed Targets

Linear Regression vs. Rdige

Tuning Ridge Regression

Triazine Dataset

Plotting coefficient values for LR

Ridge Coefficients

Boston LR Coefficients

Ridge Coefficients By alpha

Learning Curves

Lasso Regression

Lasso Regression

Grid-Search for Lasso

Grid-Search for Lasso

Coefficients for Lasso

Understanding Penalties

Understanding L1 and L2 Penalties

Understanding L1 and L2 Penalties

Understanding L1 and L2 Penalties

Understanding L1 and L2 Penalties

Elastic Net

Elastic Net

Comparing Unit Norm Contours

Parameterization in scikit-learn

Grid-Search for Elastic Net

Analyzing 2D Grid Search

Robust Regression

Robust Regression

Least Squares Fit to Outlier Data

Robust Fit

Huber Loss

RANSAC

RANSAC