Offline reinforcement learning aims to train generalized models that can reuse previously collected datasets. While most classic reinforcement methods learn by interacting with the world (online data collection), offline methods donโt collect new data at allโit only uses the dataset itโs given, much like ๐ Supervised Learning.
Formally, given a dataset
generated from some unknown policy
Distribution Shift
Theoretically, any off-policy algorithm can work for offline RL. However, this doesnโt work well in practice due to the core problem of offline learning: distribution shift.
Simply put, some actions possible in the environment wonโt be present in our dataset. Whereas online methods would be able to try the action out to see if itโs good or bad, offline algorithms canโt; in other words, counterfactual queries are possible in the online setting but impossible offline. However, our goal is still to train a policy that performs better than the dataset. Thus, unlike standard supervised settings with independent and identically distributed data and an algorithm that performs well in distribution, our goal is to learn a policy that has a even better distribution.
Specifically, if we have the policy
that maximizes the Q-function trained with
then weโre picking the best policy in the wrong distribution, effectively maximizing the error.