Theory

Assuming that our binary-labeled data has a linear split, we can classify data points by dividing them with a hyperplane defined by

Points on one side of the hyperplane belong to one class, and those on the other side belong to the other class.

There are many different planes that can divide up the data, but the best one maximizes the margin, the minimum distance from a point to the hyperplane. The datapoints that lie on the margin are called support vectors.

Info

Technically, support vectors are points that arenโ€™t in the correct side of the line past the margin. Misclassified points are also considered support vectors.

Since the overall scale of the data doesnโ€™t affect the hyperplane, SVM solves with the scaling constraint that

Note that this inequality turns into a equality for support vectors.

The size of the margin is then solely dependent on the hyperplane and can be calculated to be . To maximize this value, our objective, known as the separable primal problem, is

Our constraints satisfy the strong duality conditions, so we can use ๐Ÿ‘  Constrained Optimization to solve for . Applying the augmented Lagrangian and solving for the Lagrange dual, we can transform this into the separable dual form

where weights and bias for .

Note

By the KKT condition, one of or must be . Therefore, the data points with nonzero are support vectors.

This equation tells us that the hyperplane is solely dependent on the similarity between pairs and . Currently, this similarity is measured by the dot product, but we can instead replace it with a ๐Ÿฟ Kernel to achieve the kernelized separable dual

that can separate linearly inseparable data.

If we donโ€™t need to perfectly separate the data, the above algorithms will not converge if thereโ€™s no perfect separation. In this case, the non-separable SVM introduces the slack variable

which we can interpret as the amount that goes into the margin or the penalty on low-confidence classifications.

Adding it to the primal problem, we get a softer constraint

This form can be interpreted as minimizing the hinge loss, defined as , while applying regularization to the weights . controls for the strength of the constraint or the degree of regularization.

This form of the primal can be generalized with different ๐Ÿ“Œ Norms so that is and is .

In the non-separable dual, we have an additional upper bound on ,

Model

The model itself consists of weights and bias , which represent the hyperplane used to split the space.

Training

Training method was not covered in class, but algorithms like sequential minimal optimization (SMO) and the sub-gradient method work well.

Prediction

Given input , its class is simply , the side of the hyperplane itโ€™s on.