The gradient is a generalization of the ๐Ÿง Derivative to functions with several variables. We find the gradient by varying one variable at the time, keeping the others constant, to find partial derivatives.

Partial Derivatives

The partial derivative is defined for function as

Computing all partial derivatives and collecting them in a vector, we get the gradient

Partial Differentiation Rules

The rules from univariate differentiation still apply, but order matters since weโ€™re dealing with matrices and vectors. The following state them more concretely using partial derivatives.

  1. Product rule: .
  2. Sum rule: .
  3. Chain rule: .

Info

Note that if we compute gradients as row vectors, we can compute the chain rule for multiple multivariate functions via matrix multiplication.

Vector Gradients

Functions can also output vectors. For a function , we have

Taking the partial derivative of with respect to , we have

Each partial derivative is a column vector, and since the gradient stacks partial derivatives in a row, we find the gradient of to be

This matrix is also called the Jacobian, , .

Info

Note that if we have a function , the Jacobian locally approximates the coordinate transformation. The determinant of is the change in area or volume.

Matrix Gradients

We can compute gradients in higher dimensions as well. For example, the gradient of with respect to is a tensor with shape , and .

However, we can take advantage of the fact that there is a isomorphism from matrix space to vector space , which allows us to compute the Jacobian just like above.

The following are some useful gradients used in machine learning.

Second-Order Derivatives

Derivatives can be applied one-after-another. For example,

For a twice-differentiable function,

In other words, the order of differentiation doesnโ€™t matter.

For a function , we can compute the Hessian

This matrix measures the curvature of the function around .