Gradient descent is a method of optimizing model weights that’s commonly used when there’s no direct closed form solution; other online methods also use gradient descent when closed form solutions are too expensive to compute.
The observation at the core of gradient descent is that the gradient of our objective
Commonly, our goal in machine learning is to minimize the objective. Thus, our gradient update step subtracts the gradient
for some hyperparameter
An alternate interpretation of gradient descent is through the lens of 👠 Constrained Optimization, where our objective is to minimize a linearization of our objective in some neighborhood defined by our step size; the constraint controls how far we move with the gradient. Formally, our problem is
where our objective is the gradient direction applied to our change in parameters; that is, if
so our constraint is indeed determined by
Update Timing
The gradient
- Batch gradient descent updates weights after going through the entire dataset
- Stochastic gradient descent updates weights after computing the derivative for a single datapoint
, resulting in oscillations but decreasing convergence duration - Mini-batch gradient descent is a balance between the two, updating weights after checking
datapoints
Momentum
Convergence with standard gradient descent may be slow if the curvature is poorly scaled, like a valley. Momentum is an additional term that remembers what happened in previous update steps; incorporating this into our algorithm dampens oscillations and smoothes out the updates.
Mathematically, momentum
Then, our update rule becomes