DINO is a self-supervised pre-training framework that extends pre-training ideas from natural language processing (like in ๐Ÿงธ BERT) to vision. It performs knowledge distillation with no labels, using a student-teacher setup where the teacher is based off the student.

Formally, we have a student with parameters and teacher with . Given an input, both networks output a probability distribution defined via softmax,

In order for the two networks to learn meaningful semantics, we give them different data: for an image , we create two global views and several local views, giving everything to the student and only the global views to the teacher. The goal is for both networks to predict the same probabilities by minimizing

In other words, we want the student to output the same probabilities as the teacher, even with a local view of the original image.

Both networks use the same architecture (consisting of a general backbone and projection head), but their weights are different. is learned via gradient descent from the loss above, and is updated with the student parameters via an exponential moving average

Note that this formulation is similar to ๐Ÿผ MoCoโ€™s momentum encoder, but this setup doesnโ€™t use a queue or contrastive objective. Also, to avoid collapse, we perform centering and sharpening the teacherโ€™s outputs; that is, we maintain a center , also updated via EMA

thatโ€™s applied to the output prediction and also set up a low temperature for sharpening. The former avoids collapse from a dominant dimension by encouraging a uniform output, and the latter avoids collapse from uniformity, essentially the reverse.

Using a ๐Ÿฆฟ Vision Transformer as the backbone for , DINO pre-training learns semantically meaningful segmentations in the self-attention masks. Moreover, simple ๐Ÿ  K-Nearest Neighbors with the features yields strong performance on ImageNet classification.