❓ Thompson Sampling

In the multi-armed bandit setting, Thompson sampling estimates which is the distribution over rewards. Generalizing this to the MDP environment, we estimate the distribution over Q-functions instead, . Thus, our algorithm is as follows:

Sample a Q-function .
Act according to for one episode.
Update .

To represent , we can use bootstrapping: train multiple functions on (possibly limited) the past data, and the sampling procedure randomly picks one of the functions.

Intuitively, Thompson sampling explores by using random Q-functions. The agent commits to the consistent policy defined by this function, and our hope is that it will eventually stumble upon something good.

Explorer

❓ Thompson Sampling

Backlinks

Graph View