Regularization penalties force weights to be smaller, preventing over-reliance on certain features in the training data and therefore preventing overfitting.
Penalties commonly use 📌 Norms on the weights scaled by a strength coefficient, adding
- Ridge regression uses
norm, which encourages all weights to be smaller and shrinks larger weights the most. This is equivalent to applying MAP in 🏦 Linear Regression. - Lasso regression uses
norm, which evenly shrinks all weights and drives some to , performing feature selection. Optimization requires ⛰️ Gradient Descent. - With
norm, we get a penalty that only cares about how many weights are , again performing feature selection, which is optimized with 🔎 Greedy Search.
The following is an example of how
Elastic-net uses both
The following is a visual example of the difference between Lasso, Ridge, and Elastic-net. The rings represent contours of the loss function, and colored shapes are contours of the penalty; their intersection is the optimal parameter setting.