Cross entropy generalizes ๐ฅ Entropy to two distributions:
Putting
Note that if
Loss
In the context of machine learning, cross entropy is commonly used to compare a predicted distribution with the ground truth distribution. In a categorical setting, our model is expected to predict probabilities for each class, and our ground truth distribution is usually a one-hot encoded vector for the true class.
In this case,
- In the binary case with
datapoints, if is our true class and is our predicted probability for the first class, we have
- For
classes with datapoints, if for our true class (and for everything else) and is our predicted probability for class ,
Note that this generalizes the binary case above, which expanded the inner summation into
Lastly, if we compute probabilities with the softmax (like in virtually every classification method), the cross entropy loss can be further expressed in terms of the โscoresโ pre-softmax. If we let