Cross entropy generalizes ๐Ÿ”ฅ Entropy to two distributions: is the true distribution, and is the predicted distribution. Using them, cross entropy measures the entropy we get if we encode using our predicted distribution but have events happen according to the true distribution. In other words, our surprise is defined by our predicted probabilities while the expected value uses the actual, true probabilities of the event happening.

Putting and in their respective spots in the original equation, we get

Note that if and are equal, then cross entropy is equal to entropy. Otherwise, the cross entropy is higher than true entropy : if we use an imperfect predicted distribution, our bit encoding wonโ€™t be as optimal as if we knew the true distribution.

Loss

In the context of machine learning, cross entropy is commonly used to compare a predicted distribution with the ground truth distribution. In a categorical setting, our model is expected to predict probabilities for each class, and our ground truth distribution is usually a one-hot encoded vector for the true class.

In this case, is the ground truth, and is the model prediction. The cross entropy loss is simply . However, since is one-hot, we can simplify our equations.

  1. In the binary case with datapoints, if is our true class and is our predicted probability for the first class, we have
  1. For classes with datapoints, if for our true class (and for everything else) and is our predicted probability for class ,

Note that this generalizes the binary case above, which expanded the inner summation into classes.

Lastly, if we compute probabilities with the softmax (like in virtually every classification method), the cross entropy loss can be further expressed in terms of the โ€œscoresโ€ pre-softmax. If we let denote the score for class of datapoint , we have