Entropy regularization in reinforcement learning encourages exploration by encouraging a stochastic policy to have higher ๐Ÿ”ฅ Entropy, thus allowing it to stumble on more states by chance. The ultimate goal is thus

where

is a temperature hyperparameter that controls the degree of exploration we desire.

Our value functions follow a similar form:

Note that for the action-value, we only consider action entropy after the initial action . Thus, our values are related,

Moreover, we also have a variant of the ๐Ÿ”” Bellman Equation: