TL;DR

DINO introduces a self-supervised method that doesn’t require labels or negative samples.
It uses a teacher network and a student network, updating the teacher network’s parameters through EMA.
The teacher network sees a global view, while the student network only sees local views. The student should output features similar to the teacher.
To prevent all images from having similar features, centering and sharpening techniques are used.

Introduction

ViT requires significant computational resources and data. What are its benefits?

Motivation: Self-supervised learning has been successful in NLP. Is ViT’s lack of a clear advantage due to its reliance on supervised learning?

GPT has clearer supervision signals than image models, because it is trained to predict the next token.
Image training often reduces the rich visual information contained in an image to a single concept selected from a predefined set of a few thousand categories of objects

Key Components

Momentum encoder: Provides more stable and gradual updates.
Can be interpreted as a form of knowledge distillation without labels.
Our method can work with only a centering and sharpening of the teacher output to avoid collapse.

Existing Self-Supervised Learning Methods for Images

Instance Discrimination: Treats each image as a separate class (instance) and generates positive sample pairs through data augmentation (e.g., cropping, rotation). Images in positive set should have similar representations.
- A problem with this approach is that it does not scale well to large datasets because the number of categories grows linearly with the number of images.
- noise contrastive estimator
BYOL: Similar in concept of this paper.

Two networks: a teacher and a student. They have the same architecture but different parameters.
Given an input image x, both networks output probability distributions over K dimensions, normalized using softmax with temperature τ.
We use cross-entropy loss to make the student output similar to the teacher output.

sg means stop-gradient. Only student needs gradient.

Input Processing

Input x is processed into a global view and several cropped local views of the image.
The global view has a resolution of 224x224, covering a large area of the original image.
Local views have a resolution of 96x96, covering only a small area of the original image.
All crops are passed through the student, while only the global view is passed through the teacher.

Updating the Teacher Network

Directly copying parameters from the student network in real-time does not lead to convergence.
Freezing the teacher network for an epoch works well.
We use an exponential moving average of the student network parameters: $\theta_{\text{teacher}} \leftarrow \lambda \theta_{\text{teacher}} + (1-\lambda) \theta_{\text{student}}$ , where $\lambda$ follows a cosine schedule from 0.996 to 1 during training.
The teacher model consistently provides more stable features.

Avoiding Collapse

Collapse means that the features of all samples become almost identical.
Centering: Subtracting an output mean from the teacher output. The mean is updated using EMA.
- EMA: $a = \lambda a + (1-\lambda) b$
Sharpening: Using a lower temperature to make certain dimensions more prominent after softmax.
Centering encourages a uniform distribution, while sharpening makes the output more discriminative. They balance each other out.

EMA update for the teacher model is crucial.
Cross-entropy loss performs better than MSE loss.
Sinkhorn-Knopp improves results slightly.
Multi-crop training is important.
8x8 patch size is better than 16x16 for ViT.
The teacher model consistently outperforms the student model in validation for every epoch.
- This behavior has not been observed by other frameworks using momentum, nor when the teacher is built from the previous epoch of the student.
- We interpret the momentum teacher in DINO as a form of Polyak-Ruppert averaging.
- Polyak-Ruppert averaging aims to improve model performance and stability by averaging historical parameters.
Multi-crop can save computational resources and improve performance.