Decisions in reinforcement learning are a balance of exploration and exploitation. The former tries new actions that might find better rewards, and the latter repeats past actions that were known to yield high reward.

Exploitation is simply done by choosing the action that has the highest estimated value. Exploration, on the other hand, can be implemented in multiple ways; all share the key idea of entering new states, but their definitions for “new” vary.

Bandits

As a partial justification for exploration techniques, we often analyze their performance on 🎰 Multi-Armed Bandits. If we choose action , let the reward be defined by

where controls the reward distribution. This defines a 🪐 POMDP with state . Solving this POMDP gives us the optimal exploration strategy, but this is generally impossible since the belief state is huge; however, we can assess the goodness of an exploration algorithm by comparing it with the optimal POMDP solver. Formally, we measure regret at time step as

where is the expected reward of the best action.

Methods

The most basic exploration method is 💰 Epsilon-Greedy, but there are a variety of other methods that provide theoretical guarantees on regret. Moreover, while epsilon-greedy is entirely random, these methods have some inherent strategy to their selection for exploratory actions.

TODO CLEAN UP

Optimistic Exploration

🤩 Optimistic Exploration adds a variance estimate onto the standard reward that essentially provides a bonus for less-explored actions. Generally, our decision is

where is the average reward for action and is the variance estimate. One such example is

where is the number of times we picked action . Lower thus yields a higher bonus, encouraging our policy to pick the less-explored action. It can be shown that this example yields

which is best possible.

Thompson Sampling

❓ Thompson Sampling approximates our POMDP belief state by repeating the following:

Sample .
Pretend these parameters are correct and take the optimal action.
Observe the outcome and update our model.

Information Gain

💬 Information Gain Exploration chooses the action that gives us the most information, defined by 💰 Information Gain. Specifically, if we let be our quantity of interest (for example ) estimated via , is our action, and is our observation after (for example ), then we have

Entropy Regularization

🎲 Entropy Regularization

Explorer

🚀 Exploration

Bandits

Methods

Optimistic Exploration

Thompson Sampling

Information Gain

Entropy Regularization

Table of Contents

Backlinks

Graph View

Explorer

🚀 Exploration

Bandits §

Methods §

Optimistic Exploration §

Thompson Sampling §

Information Gain §

Entropy Regularization §

Table of Contents

Backlinks

Graph View

Bandits

Methods

Optimistic Exploration

Thompson Sampling

Information Gain

Entropy Regularization