The difference in ๐Ÿ’ง Cross Entropy and ๐Ÿ”ฅ Entropy is known as KL divergence, which can multiple common forms below:

This value can be interpreted as the expected extra number of bits to transmit using our predicted instead of true . Itโ€™s equal to if .

Info

Note that this value is non-symmetric, non-negative, and does not satisfy triangle inequality.

We commonly see KL divergence or cross entropy used as loss functions in categorization problems (for example, in ๐Ÿฆ  Logistic Regression). The truth label is a one-hot encoding, and our prediction consists of softmax probabilities. In this case, our KL divergence simplifies to

for a single datapoint where is the true label. Moreover, since the entropy of the true labels is constant regardless of our modelโ€™s predictions, we can instead just use cross entropy as our loss function.

We can also apply KL divergence to ๐Ÿ’ฐ Information Gain. Rather than computing it as the difference in entropies before and after knowing , it can instead be interpreted as the difference in distributions