Theory

Adaboost is a boosting method that focuses new learners on past errors by weighing each datapoint with .

Each learner is a stump: a decision tree with only one split. This gives the simplest possible split of the data. The learner is also weighed by , which can be interpreted as its relative importance in the ensemble.

The entire model is defined as

At every iteration, after we train a new learner, Adaboost redistributes the weights for each datapoint depending on whether got the datapoint right or wrong. The new weight calculation is as follows.

Note that is when the prediction is correct and when itโ€™s wrong. The weighing therefore increases when the prediction is wrong and decreases when itโ€™s correct. is a normalization term to keep the sum of the weights to equal .

Next to scale the learner, is defined as

While this equation for initially seems arbitrary, itโ€™s actually selected to mathematically work out with the weight distribution step above. Plugging it into the equation for , we get

This definition can then be used to define , which is simply the sum over all new weight values before normalizing (dividing by ).

Observing that and therefore , the equation above simplifies to

The update step for can now be simplified to

Now we can see that and . In other words, the equation chosen for re-weighs the datapoints so that correct ones evenly split up half the distribution and incorrect ones evenly split up half the distribution. Since our weak learner should get more correct than incorrect, the incorrect weights will always be bigger.

Lastly, note that there was no explicit loss function to optimize. However, the algorithm implicitly minimizes an exponential loss function, and it learns exponentially fast.

Note

Since it doesnโ€™t directly optimize from training error, it can also cause test error to decrease even with no training error; in a sense, it stretches the margin of classification, making the model more confident in each prediction.

Model

The model consists of many weak learners (stumps) as well as their scaling coefficients .

The entire ensemble is given by . Note that this is the exact same as stagewise regression.

Hyperparameters

  • is ensemble size, equal to number of training iterations. Higher gives a better ensemble, though it takes longer to train.

Training

Given binary classification training data with examples, , let .

For iterations,

  1. Train a weak classifier on distribution defined by weights
  2. Evaluate to calculate .
  3. Let scaling factor .
  4. Update the distribution

This gives greater weight to training examples that we got wrong.

Prediction

Given input , calculate for all and return the majority decision .