DETR (detection transformer) is a solution to object detection that uses a ๐Ÿฆพ Transformer encoder-decoder architecture. It predicts a fixed number of predictions and learns through a bipartite matching loss.

A CNN backbone first encodes the input image into a feature representation. We then collapse the features (and add positional encodings) and process it with a transformer. The transformer encoder transforms the features into queries, and the decoder transforms learned positional encodings (called object queries) with cross attention from the queries into output embeddings. These outputs are then independently decoded into box coordinates and class labels.

One key design of DETR is that the number of predicted boxes is fixed (since the feature sequence is fixed), so many of them must predict the โ€œno objectโ€ class. For predictions , we compare it with the ground truth via bipartite matching; that is, we find a permutation

that minimizes the matching loss between a ground truth and prediction:

๐Ÿ™

Using this ordering , weโ€™ll optimize our model with the Hungarian loss

๐Ÿ™

Empirically, DETR has performed on par with ๐Ÿ‘Ÿ Faster R-CNN with an advantage in predicting larger objects due to the transformerโ€™s self-attention.