XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] DETR: End-to-End Object Detection with Transformers
[Paper Note] DETR: End-to-End Object Detection with Transformers

https://arxiv.org/abs/2005.12872

Traditional object detection methods usually require manual post-processing steps to generate final bounding boxes. For example, non-maximum suppression (NMS) is commonly used to remove duplicate boxes.

Furthermore, traditional methods often rely on an initial guess, such as anchors, a grid of possible object centers, or region proposals.

This paper introduces a transformer encoder-decoder architecture that directly predicts boxes and classes in an end-to-end fashion.

To predict a set of objects without duplicates, the model needs a way for different predictions to communicate. This is achieved through the attention mechanism. In previous methods, predictions could not communicate with each other, which is why NMS was necessary to eliminate duplicates.

Method

The core goals of DETR are:

Loss Function

To achieve single-pass prediction, the decoder takes NN query tokens as input. NN is set to be significantly larger than the typical number of objects in an image. Ground truth data is padded with a \varnothing (no object) class so that the number of objects equals NN.

The Hungarian algorithm is used to find the global minimum loss pairing between the NN predictions and the NN ground truth objects. This matching process considers both the class label and the box coordinates:

Lmatch(yi,y^σ(i))=1{ci}p^σ(i)(ci)+1{ci}Lbox(bi,b^σ(i))L_{\text{match}}(y_i, \hat{y}_{\sigma(i)}) = -\,\mathbf{1}_{\{c_i \neq \emptyset\}}\,\hat{p}_{\sigma(i)}(c_i) + \mathbf{1}_{\{c_i \neq \emptyset\}}\,L_{\text{box}}(b_i, \hat{b}_{\sigma(i)})

Architecture

Auxiliary Decoding Losses
Prediction FFNs and Hungarian loss are applied after every decoder layer. All these prediction FFNs share the same weights.

Experiment

The model was trained and tested on the COCO 2017 dataset:

DETR requires a very long training time. Compared to Faster R-CNN, DETR shows a clear advantage in detecting large objects but performs worse on small objects.

Ablation Study

Why NMS is not needed:

Query Analysis:

The model generalizes well to rare scenarios. For example, even if the training set lacks images with many instances of the same object (like 13 giraffes), the model can still handle them at test time.

Panoptic Segmentation

DETR can be easily extended to panoptic segmentation.

The box embeddings (representing objects) and the encoder output are passed to a multi-head attention layer. This generates low-resolution attention maps, which are then upsampled to the original image size using a CNN. For every pixel, the model generates a binary mask for each object class. Finally, a pixel-wise argmax is used to merge these into the final mask.