Conservative Q-learning (CQL) is a framework that modifies the standard Q-function to perform better in ๐ผ Offline Reinforcement Learning settings that prevent active exploration. We can mitigate the action distribution shift problem by learning conservative action-value estimates that lower bound the true value, thus preventing our policy from exploiting inaccuracies in low-data regions of buffer
More concretely, the key behind CQL is to minimize values under some distribution
First, since weโre only interested in overestimation for unseen actions, we can restrict
where the target is
This update step will lower bound our action-value for all
where
In practice, running policy iteration with this modified evaluation objective is computationally expensive. We can instead merge the two steps together, giving us the CQL(
where
If we choose the prior to be the previous policy instead, we can analytically show that the first term above becomes an exponentially weighted average of Q-values from the previous policy. The latter choice of prior is usually more stable in practice.
We can now solve the above equation via a gradient update,
for our Q-function