Conservative Q-learning (CQL) is a framework that modifies the standard Q-function to perform better in ๐Ÿ—ผ Offline Reinforcement Learning settings that prevent active exploration. We can mitigate the action distribution shift problem by learning conservative action-value estimates that lower bound the true value, thus preventing our policy from exploiting inaccuracies in low-data regions of buffer โ€™s data distribution collected by .

More concretely, the key behind CQL is to minimize values under some distribution and maximize those under the data distribution; the former serves to provide conservative estimates, and the latter tightens the bound on data we do have.

First, since weโ€™re only interested in overestimation for unseen actions, we can restrict to be . โ€œPushing downโ€ on these estimates amounts to adding a regularization term to the Q-function objective,

where the target is

This update step will lower bound our action-value for all , but in most cases, we only care about its expectation (which gives us state value ). We can tighten our lower bound in expectation by setting and introducing the โ€œpushing upโ€ term, giving us

where is the behavior (data-generating) policy estimated from frequencies in the dataset .

In practice, running policy iteration with this modified evaluation objective is computationally expensive. We can instead merge the two steps together, giving us the CQL() algorithm:

where is some regularization, measured as โœ‚๏ธ KL Divergence with a prior. This prior can be chosen as a uniform distribution over actions; analytically solving the above min-max for this prior gives us CQL(),

If we choose the prior to be the previous policy instead, we can analytically show that the first term above becomes an exponentially weighted average of Q-values from the previous policy. The latter choice of prior is usually more stable in practice.

We can now solve the above equation via a gradient update,

for our Q-function . The effect of this optimization is that our Q-function serves as a conservative estimate to the state value and also โ€œexpandsโ€ the gap between in-distribution and out-of-distribution actions. This modified Q-update can be used directly with ๐Ÿš€ Q-Learning or ๐ŸŽญ Actor-Critic to achieve conservative versions of the two.