Experience replay is a technique used in โ™Ÿ๏ธ Reinforcement Learning, usually with ๐Ÿš€ Q-Learning.

One problem with the standard Q-learning algorithm is that our tuples will be heavily correlated with each other since weโ€™re collecting them from the same trajectory. If we train our network in this order, weโ€™re overfitting on our current trajectory.

The solution to this is a replay buffer that stores some past transitions. When training, instead of using the most recent tuple, we randomly take a batch from ; we also periodically update the buffer with our most recent data. This method disrupts the correlation, reducing variance and allowing our Q-function to generalize better. Moreover, we also improve sample efficiency since a transition can be used for multiple updates.

Prioritized Experience Replay

Prioritized Experience Replay1 notes that in the original formulation, all transitions are equally likely to be sampled, regardless of their โ€œusefulnessโ€ or โ€œimportance.โ€ It would be more efficient to instead sample the important transitions more frequently.

One straightforward notion of โ€œimportanceโ€ is the transitionโ€™s TD error (from โŒ›๏ธ Temporal Difference Learning); the higher the error, the more crucial its update will be. Thus, we prioritize experience replayโ€™s stochastic sampling by assigning higher probability to those transitions. For transition , the probability we select it is

where controls the degree of prioritization ( makes it uniform) and is the priority defined as

where is the TD error and is the rank of transition in sorted ordering.

However, since our ultimate goal for sampling is to estimate the expectation, we need to avoiding biasing our prioritized estimate with ๐Ÿช† Importance Sampling weights,

where is the number of transitions and controls the weightโ€™s importance, which can be annealed from to throughout training. In our weight update, we use instead of just .

Footnotes

  1. Prioritized Experience Replay (Schaul et al, 2016) โ†ฉ