XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Open-set object detection refers to the task of detecting objects of unknown categories, not just those from a predefined set. For example, it could involve identifying a lion’s ear. This field is advanced by introducing language to generalize to unseen objects.

Most existing open-set detectors extend closed-set detectors to open-set scenarios by incorporating language information.

fig2

Previous Works

It was challenging for previous CNN-based methods to perform feature fusion across all three phases. In contrast, Transformer-based detectors can easily integrate multi-modal information.

CLIP, for instance, does not perform well on region-text pair tasks.

GLIP introduced a new approach by incorporating contrastive training between object regions and language phrases on large-scale data.

Limitations of GLIP:

Improvements made by Grounding DINO:

This work also introduces Referring Expression Comprehension (REC), a scenario often overlooked by previous methods.

Method

The method takes an image-text pair as input. It can handle multiple pairs of object boxes and noun phrases. For Referring Expression Comprehension (REC), there is only one bounding box for each text input.

Loss Function

Bipartite Matching: How do we map the outputs of 900 queries to a small number of ground truth bounding boxes?

Experiments

The model was evaluated across various tasks:

Model Details