Offline reinforcement learning aims to train generalized models that can reuse previously collected datasets. While most classic reinforcement methods learn by interacting with the world (online data collection), offline methods don’t collect new data at all—it only uses the dataset it’s given, much like 🎓 Supervised Learning.

Formally, given a dataset

generated from some unknown policy , with , , , and , our goal is to learn the best possible policy .

Distribution Shift

Theoretically, any off-policy algorithm can work for offline RL. However, this doesn’t work well in practice due to the core problem of offline learning: distribution shift.

Simply put, some actions possible in the environment won’t be present in our dataset. Whereas online methods would be able to try the action out to see if it’s good or bad, offline algorithms can’t; in other words, counterfactual queries are possible in the online setting but impossible offline. However, our goal is still to train a policy that performs better than the dataset. Thus, unlike standard supervised settings with independent and identically distributed data and an algorithm that performs well in distribution, our goal is to learn a policy that has a even better distribution.

Specifically, if we have the policy

that maximizes the Q-function trained with

then we’re picking the best policy in the wrong distribution, effectively maximizing the error.

Explorer

🗼 Offline Reinforcement Learning

Distribution Shift

Backlinks

Graph View

Explorer

🗼 Offline Reinforcement Learning

Distribution Shift §

Backlinks

Graph View

Distribution Shift