DETR (detection transformer) is a solution to object detection that uses a ๐ฆพ Transformer encoder-decoder architecture. It predicts a fixed number
A CNN backbone first encodes the input image into a feature representation. We then collapse the features (and add positional encodings) and process it with a transformer. The transformer encoder transforms the features into queries, and the decoder transforms learned positional encodings (called object queries) with cross attention from the queries into output embeddings. These outputs are then independently decoded into box coordinates and class labels.
One key design of DETR is that the number of predicted boxes is fixed (since the feature sequence is fixed), so many of them must predict the โno objectโ class. For
that minimizes the matching loss between a ground truth and prediction:
Using this ordering
Empirically, DETR has performed on par with ๐ Faster R-CNN with an advantage in predicting larger objects due to the transformerโs self-attention.