Theory

Perceptrons are an online learning alternative to the ๐Ÿ›ฉ๏ธ Support Vector Machine. We optimize a classifying hyperplane , updating it if we get a prediction wrong and keeping it the same if we get a prediction right.

We update the plane by modifying its perpendicular vector to look more like the example we get wrong. In other words, our update step is

with a certain learning rate .

Intuitively, we can see the new classification (assuming for simplicity) as getting more positive or negative, depending on the class.

For linearly separable data and , the algorithm will converge after mistakes where (size of biggest ) and margin , being the optimal classifier.


Proof of Convergence

Weโ€™ll prove the convergence property for for all . Then, the perceptron will convergence in steps.

First, note that we can assume that since the optimal hyperplane is scale-invariant. Then, let be the angle between and , and consider .

Letโ€™s say we got wrong at time , so . Then, weโ€™ll update , and consider the quantity for the numerator.

Next, letโ€™s see what happens to the square of the denominator.

Over time-steps, we have and . Putting them together,

However, the maximum value for cosine is , so


If the data is not linearly separable, the simple perceptron will keep updating , we can need to manually stop training. After stopping, there are variations in the way we get our final model.

  1. Voted perceptron: keep track of all intermediate models, predict the majority vote of running input through all models
  2. Averaged perceptron: final model is average of all intermediate models, essentially an approximation of voted perceptron that has much faster prediction times

We can also apply the ๐Ÿฟ Kernel trick to the perceptron. Note that due to our update rule, our weights will always be a linear combination of . Then, if we let , we can replace all instances of with so that everything is in terms of inner products between inputs. Then, we replace with to achieve the kernelized perceptron.

A nuanced variation is the Margin-Infused Relaxed Algorithm (MIRA), a type of passive-aggressive perceptron that which minimizes the hinge loss at each observation. It updates to be as close as possible to the old weights while setting hinge loss to (in case of misclassification or correct prediction within the margin); in other words, it makes the smallest change to for a correct prediction outside the margin.

Model

Same as SVM, we maintain a hyperplane that splits the space, classifying each side. We update the hyperplane with learning rate , which we can configure to be different for different loss functions; the passive-aggressive perceptron, for example, uses it to optimize hinge loss.

Training

Given training data , randomly initialize hyperplane .

While incorrectly classifies datapoints in , for each ,

  1. Predict .
  2. If , update

Info

For a multi-class perceptron, in case of a wrong class prediction, we update both the weights for the correct and incorrect classes; the former adds while the latter subtracts it.

For simple perceptrons, ; with passive-aggressive perceptrons,

where is the hinge loss .

Prediction

As with SVM, given input , our prediction is .