Linear Models

Regression

Linear Regression

Model:

Cost Function: Mean Squared Error

Training:

  • The Normal Equation:

  • Gradient Descent:

    Refer to Gradient Descent

  • The Normal Equation v.s. Gradient Descent

Normal Equation Gradient Descent
Pros
  • No need to choose $\alpha$
  • No need to iterate
  • Work well even $n$ is large
Cons
  • Need to calculate $(\mathbf X^T \mathbf X)^{-1}$
  • Slow if $n$ is very large
  • Need to choose $\alpha$
  • Need to iterate

Code

Linear Regression with Normal Equation:

from sklearn.linear_model import LlinearRegression


lin_reg = LlinearRegression()
lin_reg.fit(X_train, Y_train)
lin_reg.predict(X_new)

Linear Regression with Gradient Descent

from sklearn.linear_model import SGDRegressor


sgd_reg = SGDRegressor(n_iter=n_epochs, penalty=None, eta0=learning_rate)
sgd_reg.fit(X_train, Y_train)

Ridge Regression

Add L2 regularization to Linear Regression's Cost function.

Model:

the same as linear regression

Cost Function:

Training:

  • The Normal Equation (Cholesky):

    where $\mathbf A$ is an identify matrix

  • Gradient Descent:

    Refer to Gradient Descent

Code

Ridge Regression with Cholesky Equation:

from sklearn.linear_model import Ridge


ridge_reg = Ridge(alpha=1, solver='cholesky')
ridge_reg.fit(X_train, Y_train)
ridge_reg.predict(X_new)

Ridge Regression with Gradient Descent:

from sklearn.linear_model import SGDRegressor


sgd_reg = SGDRegressor(penalty="l2")
sgd_reg.fit(X_train, Y_train)

Lasso Regression

Add L1 regularization to Linear Regression's Cost function.

Model:

the same as linear regression

Cost Function:

Training:

The Lasso cost function is not differentiable at $\theta_i = 0$, but Gradient Descent still works fine if subgradient vector is used when $\theta_i = 0$

code

from sklearn.linear_model import Lasso


lasso_reg = Lasso()
lasso_reg.fit(X_train, Y_train)
lasso_reg.predict(X_new)
from sklearn.linear_model import SGDRegressor


sgd_reg = SGDRegressor(penalty='l1')
sgd_reg.fit(X_train, Y_train)
sgd_reg.predict(X_new)

Elastic Net

Add a mix of L1 and L2 regularization into Linear Regression's cost function

Model:

the same as linear regression

Cost function

Training

Code

from sklearn.linear_model import ElasticNet


elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X_train_, Y_train)
elastic_net.predict(X_new)

Comparison Between L1 regularization and L2 regularization

L1 Regularization L2 Regularization
solution uniqueness No Yes
sparsity Yes No
feature selection Yes No
computational efficiency Low (No analytical solution) High

How to choose between regression Models

  • It is always preferable to have some regularization; thus, avoid plain Linear Regression models.
  • Ridge Regression is a good default.
  • If you suspect only a few features are actually useful, use either Lasso Regression or Elastic Net.
  • Generally speaking, Elastic Net is more ideal than Lasso since Lasso may behave erratically (1) when the number of features ($n$) is greater the number of training instances ($m$) (2) when several features are strongly correlated.

Classification

Logistic Regression (Binary Classes)

Model

Estimate Probabilities:

Prediction:

Cost function

Training

The function above is convex so gradient descent is guaranteed to find the global minimum.

Code

from sklearn.linear_model import LogisticRegression


log_reg = LogisticRegression()
log_reg.fit(X_train, Y_train)
log_reg.predict(X_new)

Logistic Regression (Multiple Classes)

Models

Compute score for each class:

Note: Each class has its own set of $\theta_k$

Normalize score with Softmax function:

Prediction:

Cost Function

Training

Code

from sklearn.linear_model import LogisticRegression


multi_log_reg = LogisticRegression(multi_class='multinomial', solver='sag', C=10)
multi_log_reg.fit(X_train, Y_train)
multi_log_reg.predict(X_train)