XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Vision-Language Pre-training

Given the existence of powerful language models (LLMs) and image encoders, a natural progression is to integrate them.

To prevent catastrophic forgetting, LLMs are often frozen, which can make training difficult.

Method

The training is divided into two stages:

fig2

Q-Former Architecture

This bottleneck architecture, combined with our pre-training objectives, forces the queries to extract visual information most relevant to the text.

Three Training Objectives

The model is optimized simultaneously using three training objectives:

We remove the last layer of the Vision Transformer (ViT) and use the output features from the second-to-last layer, which leads to slightly better performance.

references