MoCo (momentum contrast) is an unsupervised representation learning technique for images. We can generalize contrastive learning to the task of training some encoder for a dictionary look-up: given an encoded query and encoded keys of the dictionary, find the correct match (positive key) for . In MoCo, the correct match is the key image that came from the same source image as the query (after augmentation).

To find the match, our goal is to make and more similar than and any other . Measuring similarity with the dot product, we can achieve this by minimizing the โ„น๏ธ InfoNCE loss,

where is a temperature parameter.

A standard training method would be to sample a minibatch and minimize this loss by picking each item in the minibatch to be positive and set the rest to be negative. However, if we can compare the positive with more negative samples, our training would be more stable.

MoCo introduces a queue that represents our dictionary , effectively decoupling a minibatch from the dictionary. When we sample a new minibatch, everythingโ€™s enqueued, and old keys are popped. The entire queue is used in the loss computation, thereby increasing the number of negative samples we compare to.

Another innovation is that can be different for queries and keys. Weโ€™ll have a key encoder and query encoder ; can directly be trained via backpropagation, but since our queue can be very large, we canโ€™t compute the gradient for each negative sample to optimize . Other works have set , but this yields poor results since in our setting, the encoderโ€™s updates are making the encoded keys too inconsistent. Alternatively, for , we can use a momentum update,

effectively making โ€œlagโ€ behind , maintaining stability.