XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] Emerging Properties in Self-Supervised Vision Transformers

https://arxiv.org/abs/2104.14294

TL;DR

  • DINO introduces a self-supervised method that doesn’t require labels or negative samples.
  • It uses a teacher network and a student network, updating the teacher network’s parameters through EMA.
  • The teacher network sees a global view, while the student network only sees local views. The student should output features similar to the teacher.
  • To prevent all images from having similar features, centering and sharpening techniques are used.

Introduction

ViT requires significant computational resources and data. What are its benefits?

Motivation: Self-supervised learning has been successful in NLP. Is ViT’s lack of a clear advantage due to its reliance on supervised learning?

Key Components

Related Work

Existing Self-Supervised Learning Methods for Images

Approach

sg means stop-gradient. Only student needs gradient.


Input Processing

Updating the Teacher Network

Avoiding Collapse

Ablation Study

Related Work