Decisions in reinforcement learning are a balance of exploration and exploitation. The former tries new actions that might find better rewards, and the latter repeats past actions that were known to yield high reward.
Exploitation is simply done by choosing the action that has the highest estimated value. Exploration, on the other hand, can be implemented in multiple ways; all share the key idea of entering new states, but their definitions for โnewโ vary.
Bandits
As a partial justification for exploration techniques, we often analyze their performance on ๐ฐ Multi-Armed Bandits. If we choose action
where
where
Methods
The most basic exploration method is ๐ฐ Epsilon-Greedy, but there are a variety of other methods that provide theoretical guarantees on regret. Moreover, while epsilon-greedy is entirely random, these methods have some inherent strategy to their selection for exploratory actions.
TODO CLEAN UP
Optimistic Exploration
๐คฉ Optimistic Exploration adds a variance estimate onto the standard reward that essentially provides a bonus for less-explored actions. Generally, our decision is
where
where
which is best possible.
Thompson Sampling
โ Thompson Sampling approximates our POMDP belief state
- Sample
. - Pretend these parameters are correct and take the optimal action.
- Observe the outcome and update our model.
Information Gain
๐ฌ Information Gain Exploration chooses the action that gives us the most information, defined by ๐ฐ Information Gain. Specifically, if we let