In the multi-armed bandit setting, Thompson sampling estimates
- Sample a Q-function
. - Act according to
for one episode. - Update
.
To represent
Intuitively, Thompson sampling explores by using random Q-functions. The agent commits to the consistent policy defined by this function, and our hope is that it will eventually stumble upon something good.