Optimistic exploration augments the standard reward function by adding an exploration bonus,
where is a โcountโ for state . As increases, we design to decrease. Many designs for exist, but some examples are below:
UCB: .
MBIE-EB: .
BEB: .
However, we donโt literally count because in complex environments, we almost never visit the same state twice. Thus, we employ some similarity measures, so is a measure for how similar is to our past states. There are a variety of methods to measure this pseudo-count, as described below.
We can fit a density model on our past experience. The idea is that if a state is similar to what we saw before, would be high.
In the simple case where we measure true counts, our density would be exactly
where is the total number of states visited. This is a system of equations we can use to solve for our pseudo-counts and , so if we have the density model and the updated density trained on past experience plus ,
Alternatively, we can measure actual counts, but weโll instead count a hashed version of the state . If our encoding method maps similar states to the same hash, then we effectively achieve the original goal. Empirical implementations use an ๐งฌ Autoencoder trained to reconstruct the states, and the hash is the bottleneck.
Exemplar modeling implicitly captures density by comparing with all past states. Intuitively, the state is novel is itโs easy to distinguish from past states. Thus, if we have a classifier trained with positive class and everything else in the negative class, density is
Note that if we visited before, it would also appear in the negative class. Moreover, if we include regularization, serves as an good measure for the uniqueness of .
If we have some target function and a fitted function that uses samples from , would be good at estimating at the samples but worse in far-away regions. Thus, we can measure the novelty of a state as the error
with trained on past states. can be any function, but past implementations used or where is completely random.