XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] Learning Transferable Visual Models From Natural Language Supervision

TL;DR

  • CLIP maps images and text into the same embedding space.
  • It is trained using contrastive learning. Within each batch, the similarity between correct image and text feature pairs is maximized, while the similarity between incorrect pairs is minimized.
  • Both the image and text encoders are trained from scratch. The text encoder also has an autoregressive loss.
  • The output of the EOS token in the text encoder is used as the feature of the text.

Previous Methods vs. Our Method

Previous methods rely on:

Our method offers a different approach:

The effectiveness of our method can be validated through various tasks, including:

Background

Method

using contrastive learning

Objective


Training

Text encoder

hyperparameters

Experiments