TL;DR

CLIP maps images and text into the same embedding space.
It is trained using contrastive learning. Within each batch, the similarity between correct image and text feature pairs is maximized, while the similarity between incorrect pairs is minimized.
Both the image and text encoders are trained from scratch. The text encoder also has an autoregressive loss.
The output of the EOS token in the text encoder is used as the feature of the text.

Previous Methods vs. Our Method

Previous methods rely on:

Predicting a fixed set of predetermined object categories.
Limited supervision, which restricts the model’s ability to generalize.
Classification-based approaches, requiring new classifications to specify any other visual concept.

Our method offers a different approach:

The effectiveness of our method can be validated through various tasks, including:

In NLP, models learn from raw text without being limited to specific tasks, using a general prediction method.
Previous methods lacked sufficient scale.
Linear-probe representation learning analysis involves training a linear classifier on top of the model and comparing the results to determine which model performs better.
The main focus of this type of work is using language as a training signal.

Existing ImageNet-based models require significant computational resources. Therefore, a fast encoder is needed.

using contrastive learning

Jointly training an image CNN and a text transformer from scratch to predict the caption of an image was unsuccessful.
Problem: Attempting to predict the exact words of the text is difficult due to the wide variety of captions.
Therefore, contrastive learning is used.

Objective

Proxy task: Predicting which text as a whole is paired with which image.
CLIP learns a multi-modal embedding space.
It jointly trains an image encoder and a text encoder, maximizing the cosine similarity of real pairs and minimizing the similarity of incorrect pairs within a batch.
We optimize a symmetric cross-entropy loss over these similarity scores.

Training

training from scratch, not using pretrained text model.
Only a linear projection is used between the encoder space and the contrastive embedding space.
Adding more layers to the projection does not change the performance. Non-linear projections are typically used in self-supervised learning that uses only images.
A random square crop from resized images is the only data augmentation used during training.
The CLIP model treats the temperature parameter τ in the softmax function as a learnable parameter and parameterizes it in logarithmic form (i.e., optimizes log(τ)), avoiding manual hyperparameter tuning.

Text encoder

For text embedding, the text sequence is bracketed with [SOS] and [EOS] tokens. The activations of the highest layer of the transformer at the [EOS] token are treated as the feature representation of the text, which is layer normalized and then linearly projected into the multi-modal embedding space.
The text encoder also optimizes an autoregressive loss.

hyperparameters

ViT performs better than ResNet.
can do zero-shot on unseen datasets.
Adding more words to the prompt can increase classification accuracy. For example, “a photo of {label}, a type of pet”.
Linear probe is used to evaluate the quality of the learned representations.
Zero-shot CLIP is much more robust to distribution shift than standard ImageNet models.