Value implicit pre-training (VIP) is a self-supervised representation learning method for implicitly generating goal-based value functions for reinforcement learning. The core of the algorithm is to learn an embedding of images that we can later use to derive the value function.

To learn this embedding, we can use ego-centric human data; though itโ€™s out of distribution for robotic observations, the embedding we can learn from this data can be shared across both domains. Specifically, if we wanted to find some optimal human policy that maximizes some reward , the KL-regularized reinforcement learning objective has (assuming a deterministic policy) the dual problem of finding some optimal embedding and value function :

In our scenario, we set the reward:

Our value function captures the discounted number of steps needed to reach from , so our embedding learns the features needed to predict temporal distance. In fact, with some manipulations, we can show that with optimal , is trained via an objective similar to โ„น๏ธ InfoNCE, which we can interpret as a contrastive method of attracting the initial and goal states and repelling intermediate states, thereby giving us a smooth curve in the embedding space and a constantly decreasing embedding sequence as we transition toward the goal.

However isnโ€™t known, but since itโ€™s a distance measure in the context of InfoNCE, we can use distance,

Our final training objective for is thus to minimize

With a learned embedding, we can then then compute a reward function and train a policy via offline trajectory optimization; in the VIP paper, MPPI was used. Moreover, we can also perform few-shot learning with reward-weighted regression, a weighted form of behavior cloning.

LIV

Langauge-image value (LIV) learning extends VIP to textual goals by aligning text and image modalities in the same embedding space. At first glance, we can try optimizing the summed VIP loss on both image and text encoders; the text loss compares with for language encoder , but otherwise, the language loss is the same as above.

However, while this does not explicitly ensure that the text and image embeddings are aligned, it can be shown that for a constant video distribution consisting of only the goal image, the VIP text loss becomes InfoNICE,

where , our similarity measure, is chosen to be cosine similarity to enable comparisons with ๐ŸŒ CLIP.

Thus, during training, we optimize the original VIP objective along with InfoNICE loss. In practice, we can pre-train our embedding on narrated videos (EpicKitchen, for example) and train a policy with this embedding using ๐Ÿต Behavioral Cloning. The LIV objective can also be used to fine-tune pre-trained vision-language models using in-domain task data.