Decisions in reinforcement learning are a balance of exploration and exploitation. The former tries new actions that might find better rewards, and the latter repeats past actions that were known to yield high reward.

Exploitation is simply done by choosing the action that has the highest estimated value. Exploration, on the other hand, can be implemented in multiple ways; all share the key idea of entering new states, but their definitions for โ€œnewโ€ vary.

Bandits

As a partial justification for exploration techniques, we often analyze their performance on ๐ŸŽฐ Multi-Armed Bandits. If we choose action , let the reward be defined by

where controls the reward distribution. This defines a ๐Ÿช POMDP with state . Solving this POMDP gives us the optimal exploration strategy, but this is generally impossible since the belief state is huge; however, we can assess the goodness of an exploration algorithm by comparing it with the optimal POMDP solver. Formally, we measure regret at time step as

where is the expected reward of the best action.

Methods

The most basic exploration method is ๐Ÿ’ฐ Epsilon-Greedy, but there are a variety of other methods that provide theoretical guarantees on regret. Moreover, while epsilon-greedy is entirely random, these methods have some inherent strategy to their selection for exploratory actions.

TODO CLEAN UP

Optimistic Exploration

๐Ÿคฉ Optimistic Exploration adds a variance estimate onto the standard reward that essentially provides a bonus for less-explored actions. Generally, our decision is

where is the average reward for action and is the variance estimate. One such example is

where is the number of times we picked action . Lower thus yields a higher bonus, encouraging our policy to pick the less-explored action. It can be shown that this example yields

which is best possible.

Thompson Sampling

โ“ Thompson Sampling approximates our POMDP belief state by repeating the following:

  1. Sample .
  2. Pretend these parameters are correct and take the optimal action.
  3. Observe the outcome and update our model.

Information Gain

๐Ÿ’ฌ Information Gain Exploration chooses the action that gives us the most information, defined by ๐Ÿ’ฐ Information Gain. Specifically, if we let be our quantity of interest (for example ) estimated via , is our action, and is our observation after (for example ), then we have

Entropy Regularization

๐ŸŽฒ Entropy Regularization