Optimistic exploration augments the standard reward function by adding an exploration bonus,

where is a โ€œcountโ€ for state . As increases, we design to decrease. Many designs for exist, but some examples are below:

  1. UCB: .
  2. MBIE-EB: .
  3. BEB: .

However, we donโ€™t literally count because in complex environments, we almost never visit the same state twice. Thus, we employ some similarity measures, so is a measure for how similar is to our past states. There are a variety of methods to measure this pseudo-count, as described below.

Density Modeling

We can fit a density model on our past experience. The idea is that if a state is similar to what we saw before, would be high.

In the simple case where we measure true counts, our density would be exactly

where is the total number of states visited. This is a system of equations we can use to solve for our pseudo-counts and , so if we have the density model and the updated density trained on past experience plus ,

Hashing

Alternatively, we can measure actual counts, but weโ€™ll instead count a hashed version of the state . If our encoding method maps similar states to the same hash, then we effectively achieve the original goal. Empirical implementations use an ๐Ÿงฌ Autoencoder trained to reconstruct the states, and the hash is the bottleneck.

Exemplar Modeling

Exemplar modeling implicitly captures density by comparing with all past states. Intuitively, the state is novel is itโ€™s easy to distinguish from past states. Thus, if we have a classifier trained with positive class and everything else in the negative class, density is

Note that if we visited before, it would also appear in the negative class. Moreover, if we include regularization, serves as an good measure for the uniqueness of .

Error

If we have some target function and a fitted function that uses samples from , would be good at estimating at the samples but worse in far-away regions. Thus, we can measure the novelty of a state as the error

with trained on past states. can be any function, but past implementations used or where is completely random.