Theory

Certain features in are indicative of what can be; given training data , we can divide up all into groups based on their features such that each group has different .

Decision trees use this idea to recursively divide up the data in , building a binary tree where each node is a โ€œyes-noโ€ question, and depending on the answer, we go to a different subtree; each leaf gives a prediction for the value of . This structure is analogous to a ๐ŸŒณ Binary Search Tree but with questions at the nodes rather than value comparisons.

To maximize the effectiveness of our questions, we want to split up the data that gets to a node using a question about a feature that most determines . To do this, we maximize ๐Ÿ’ฐ Information Gain.

Model

Our model is a binary tree: each internal node is a question, each edge is an answer, and leaves are predictions for .

For discrete features, we can directly check the value in . For continuous features, we must impose a threshold on the value in to get a binary answer; in other words, replace feature with .

Note

Note that with this threshold method, binary trees are scale invariant. If the scale of a feature increases, the best threshold will increase in scale equivalently. Therefore, the new decision tree after rescaling a feature will be the same.

Training

Given training data ,

  1. If all current records have the same label , return a leaf with label .
  2. If all current records have the same inputs , return a leaf with majority decision .
  3. Otherwise, split on the feature that has most information gain and recurse on both sides.

Prediction

Given input , walk down the decision tree according to the questions at each node and return the associated at the leaf we reach.