Theory

Assume our input data is linear. That is, it follows the following function.

Noise can be interpreted as randomness or effects of other features not included in . Another way to write is as follows.

Linear regression fits a linear model that maximizes the fit, which equates to minimizing the error or maximizing the likelihood , both of which are defined below.

To maximize the likelihood, we find that maximizes the log-likelihood, which simplifies the math. This gives us an equation thatโ€™s analogous to minimizing .

The following is an example of performing (MLE) linear regression on single-feature . Note that with regularization (MAP), we need to incorporate an extra term to our error

that pushes the weights toward . This makes MAP not scale-invariant as the scale of the weights now matter; MLE, on the other hand, is scale invariant and will always find the parameters that maximize likelihood.

Model

Our model consists of the weight matrix (which can be MLE or MAP). With , we apply a prior (thatโ€™s usually ), which causes a regularization effect.

Training

Given training data and , assume weight prior . Let regularization term , then closed form solution is as follows.

If we donโ€™t use a prior, our MLE closed form is as follows

To get bias term , add a new feature for all training examples; the coefficient for this feature is our bias

Note that computing the closed form solution may be expensive, in which case ๐Ÿ—ผ Least Mean Squares provides an alternative optimization method using gradient descent.

Prediction

Our prediction for input is , which returns the location of the point on the line