MoCo (momentum contrast) is an unsupervised representation learning technique for images. We can generalize contrastive learning to the task of training some encoder for a dictionary look-up: given an encoded query and encoded keys of the dictionary, find the correct match (positive key) for . In MoCo, the correct match is the key image that came from the same source image as the query (after augmentation).
To find the match, our goal is to make and more similar than and any other . Measuring similarity with the dot product, we can achieve this by minimizing the โน๏ธ InfoNCE loss,
where is a temperature parameter.
A standard training method would be to sample a minibatch and minimize this loss by picking each item in the minibatch to be positive and set the rest to be negative. However, if we can compare the positive with more negative samples, our training would be more stable.
MoCo introduces a queue that represents our dictionary , effectively decoupling a minibatch from the dictionary. When we sample a new minibatch, everythingโs enqueued, and old keys are popped. The entire queue is used in the loss computation, thereby increasing the number of negative samples we compare to.
Another innovation is that can be different for queries and keys. Weโll have a key encoder and query encoder ; can directly be trained via backpropagation, but since our queue can be very large, we canโt compute the gradient for each negative sample to optimize . Other works have set , but this yields poor results since in our setting, the encoderโs updates are making the encoded keys too inconsistent. Alternatively, for , we can use a momentum update,
effectively making โlagโ behind , maintaining stability.