Theory
Assuming that our binary-labeled data has a linear split, we can classify data points by dividing them with a hyperplane defined by
Points on one side of the hyperplane belong to one class, and those on the other side belong to the other class.
There are many different planes that can divide up the data, but the best one maximizes the margin, the minimum distance from a point to the hyperplane. The datapoints that lie on the margin are called support vectors.
Info
Technically, support vectors are points that arenโt in the correct side of the line past the margin. Misclassified points are also considered support vectors.
Since the overall scale of the data doesnโt affect the hyperplane, SVM solves with the scaling constraint that
Note that this inequality turns into a equality for support vectors.
The size of the margin is then solely dependent on the hyperplane and can be calculated to be
Our constraints satisfy the strong duality conditions, so we can use ๐ Constrained Optimization to solve for
where weights
Note
By the KKT condition, one of
or must be . Therefore, the data points with nonzero are support vectors.
This equation tells us that the hyperplane is solely dependent on the similarity between pairs
that can separate linearly inseparable data.
If we donโt need to perfectly separate the data, the above algorithms will not converge if thereโs no perfect separation. In this case, the non-separable SVM introduces the slack variable
which we can interpret as the amount that
Adding it to the primal problem, we get a softer constraint
This form can be interpreted as minimizing the hinge loss, defined as
This form of the primal can be generalized with different ๐ Norms so that
In the non-separable dual, we have an additional upper bound on
Model
The model itself consists of weights
Training
Training method was not covered in class, but algorithms like sequential minimal optimization (SMO) and the sub-gradient method work well.
Prediction
Given input