Value implicit pre-training (VIP) is a self-supervised representation learning method for implicitly generating goal-based value functions for reinforcement learning. The core of the algorithm is to learn an embedding
To learn this embedding, we can use ego-centric human data; though itโs out of distribution for robotic observations, the embedding we can learn from this data can be shared across both domains. Specifically, if we wanted to find some optimal human policy that maximizes some reward
In our scenario, we set the reward:
Our value function
However
Our final training objective for
With a learned embedding, we can then then compute a reward function and train a policy via offline trajectory optimization; in the VIP paper, MPPI was used. Moreover, we can also perform few-shot learning with reward-weighted regression, a weighted form of behavior cloning.
LIV
Langauge-image value (LIV) learning extends VIP to textual goals by aligning text and image modalities in the same embedding space. At first glance, we can try optimizing the summed VIP loss on both image and text encoders; the text loss compares
However, while this does not explicitly ensure that the text and image embeddings are aligned, it can be shown that for a constant video distribution consisting of only the goal image, the VIP text loss becomes InfoNICE,
where
Thus, during training, we optimize the original VIP objective along with InfoNICE loss. In practice, we can pre-train our embedding on narrated videos (EpicKitchen, for example) and train a policy with this embedding using ๐ต Behavioral Cloning. The LIV objective can also be used to fine-tune pre-trained vision-language models using in-domain task data.