YOLO (You Only Look Once) performs multi-object detection with a single pass through a ๐๏ธ Convolutional Neural Network.
We divide the input image with a grid, and each grid cell is responsible for predicting bounding boxes for objects whose center is in the cell. A single cell is limited to predicting
Architecture
The architecture begins with DarkNet to detect features from image. It then flattens the result and feeds it into FC layers.
The final output has shape
Each cell is in charge of
Confidence depends on both the probability of there being an object as well as the predicted IoU of the bounding box with the objectโs actual bounding box.
Losses
We train YOLO with gradient descent on a loss function that assigns bounding boxes to grid cells.
- For each actual object, calculate the squared error of the center coordinate,
- For each actual object, calculate the squared error of roots of the dimensions, โ
Rooting the values makes the loss penalize small variations in small boxes more than small variations in big boxes. 3. For each actual object, calculate the error between the actual IoU of the bounding box and the predicted IoU; in other words, penalize mis-predicted confidences,
Note that this error is a moving target since we want the network to predict the IoU of the predicted bounding box.
4. Do the same as above if the cell isnโt responsible for a bounding box but multiply this error with
- For each actual object, calculate the squared error between the probabilities of each class,
Prediction
During prediction, we first run the CNN on the input image. Since some objects near the edge of a cell may cause multiple bounding boxes to be predicted, we then run ๐ Non-Max Suppression to remove redundant predictions.
Details
The theory and descriptions capture the broad idea of the model, but there are a few notational and implementational details to be aware off.
First, we need to scale all predictions to be between
and is location of center relative to the grid cell. and are lengths relative to size of the entire image. is IoU between predicted box and ground truth box. Only one bounding box predictor is responsible for each object; the one responsible is assigned based on highest IoU. means the -th bounding box predictor in cell is responsible for the bounding box. means there is an object in cell . means thereโs no object for the -th bounding box predictor in cell .
Info
If there are
objects in a cell, YOLO isnโt able to predict bounding boxes for all of them. This is a weakness in the design of the algorithm, but itโs extremely unlikely for this is occur given a small-enough cell size and large-enough .