YOLO (You Only Look Once) performs multi-object detection with a single pass through a ๐Ÿ‘๏ธ Convolutional Neural Network.

We divide the input image with a grid, and each grid cell is responsible for predicting bounding boxes for objects whose center is in the cell. A single cell is limited to predicting boxes and one class for all bounding boxes predicted by the cell.

Architecture

The architecture begins with DarkNet to detect features from image. It then flattens the result and feeds it into FC layers.

The final output has shape where is the size of the grid, is the number of classes, and is the number of boxes per cell.

Each cell is in charge of values: probabilities for classes, then confidence, x, y, width, height for each of the bounding boxes.

Confidence depends on both the probability of there being an object as well as the predicted IoU of the bounding box with the objectโ€™s actual bounding box.

Losses

We train YOLO with gradient descent on a loss function that assigns bounding boxes to grid cells.

  1. For each actual object, calculate the squared error of the center coordinate,
๐Ÿ™
  1. For each actual object, calculate the squared error of roots of the dimensions, โ€
๐Ÿ™

Rooting the values makes the loss penalize small variations in small boxes more than small variations in big boxes. 3. For each actual object, calculate the error between the actual IoU of the bounding box and the predicted IoU; in other words, penalize mis-predicted confidences,

๐Ÿ™

Note that this error is a moving target since we want the network to predict the IoU of the predicted bounding box. 4. Do the same as above if the cell isnโ€™t responsible for a bounding box but multiply this error with to decrease its importance relative to confidences for cells with objects,

๐Ÿ™
  1. For each actual object, calculate the squared error between the probabilities of each class,
๐Ÿ™

Prediction

During prediction, we first run the CNN on the input image. Since some objects near the edge of a cell may cause multiple bounding boxes to be predicted, we then run ๐ŸŽ Non-Max Suppression to remove redundant predictions.

Details

The theory and descriptions capture the broad idea of the model, but there are a few notational and implementational details to be aware off.

First, we need to scale all predictions to be between and . To do this, we need specific definitions of the variables.

  1. and is location of center relative to the grid cell.
  2. and are lengths relative to size of the entire image.
  3. is IoU between predicted box and ground truth box. Only one bounding box predictor is responsible for each object; the one responsible is assigned based on highest IoU.
  4. ๐Ÿ™ means the -th bounding box predictor in cell is responsible for the bounding box.
  5. ๐Ÿ™ means there is an object in cell .
  6. ๐Ÿ™ means thereโ€™s no object for the -th bounding box predictor in cell .

Info

If there are objects in a cell, YOLO isnโ€™t able to predict bounding boxes for all of them. This is a weakness in the design of the algorithm, but itโ€™s extremely unlikely for this is occur given a small-enough cell size and large-enough .