Offline reinforcement learning aims to train generalized models that can reuse previously collected datasets. While most classic reinforcement methods learn by interacting with the world (online data collection), offline methods donโ€™t collect new data at allโ€”it only uses the dataset itโ€™s given, much like ๐ŸŽ“ Supervised Learning.

Formally, given a dataset

generated from some unknown policy , with , , , and , our goal is to learn the best possible policy .

Distribution Shift

Theoretically, any off-policy algorithm can work for offline RL. However, this doesnโ€™t work well in practice due to the core problem of offline learning: distribution shift.

Simply put, some actions possible in the environment wonโ€™t be present in our dataset. Whereas online methods would be able to try the action out to see if itโ€™s good or bad, offline algorithms canโ€™t; in other words, counterfactual queries are possible in the online setting but impossible offline. However, our goal is still to train a policy that performs better than the dataset. Thus, unlike standard supervised settings with independent and identically distributed data and an algorithm that performs well in distribution, our goal is to learn a policy that has a even better distribution.

Specifically, if we have the policy

that maximizes the Q-function trained with

then weโ€™re picking the best policy in the wrong distribution, effectively maximizing the error.