Theory

Gradient tree boosting builds an ensemble of trees (instead of stumps as in ๐Ÿ”ฅ Adaboost) of a specified depth, each with a different scaling coefficient and trained on a fraction of the dataset.

The general idea is very similar to a ๐ŸŒฒ Random Forest, but boosting introduces two different ideas.

  1. Fit each new tree on the residual instead of the original data.
  2. Regress to find the scaling coefficient and add it to the model scaled by learning rate.

We can train the algorithm on any loss function; the residual we fit on is the gradient of this loss. For simplicity, weโ€™ll use regression loss, and we move down the gradient every time we add a tree, thereby decreasing loss with each new addition to the model.

Model

Our model consists of trees with scaling coefficient . In other words, our model is

where a constant term (generally used for regression) thatโ€™s equivalent to .

The trees are limited to depth , and each are trained on fraction of the data. is also used as a learning rate when we add new trees.

Training

Given training data , let be the average value.

Then, for iterations,

  1. Pick a subset of the data of size .
  2. Calculate the residual of this subset .
  3. Fit a decision tree on the residual.
  4. Regress for .
  5. Update our model using the learning rate.

Prediction

Similar to Adaboost, given input , calculate for all and return