Faster R-CNN is a two-stage object detection network that builds on ideas from R-CNN and Fast R-CNN. Following the traditional object detection pipeline, we have a region proposal stage and a classification stage; the key innovation in Faster R-CNN is that both stages, trained as ๐Ÿ‘๏ธ Convolutional Neural Networks, operate on a feature map rather than the original image.

We first use a feature network (such as VGG-16) to convert the image into a feature map. Weโ€™ll use this feature map for both region proposal and classification.

Region Proposal Network

The Region Proposal Network (RPN) predicts regions of interest (RoI) from the feature map. To do so, we slide a window across the feature map and feed the window into a CNN; this CNN predicts region proposals, each is which is associated with an anchor box and defined by coordinates and objectness scores.

  1. Coordinates represent the location (center coordinate) and scale of the bounding box.
  2. Objectness scores estimate the probability of an object or no object.

Anchor Boxes

Anchor boxes are boxes of varying scales and aspect ratios that allow the RPN to specialize each of its coordinate and objectness predictions to different objects. In Faster R-CNN, we use 3 scales and 3 aspect ratios; during training, ground truth boxes are (typically) assigned to the anchor boxes that most resemble them, thus encouraging specialization.

Training

Specifically, we assign an anchor to have an object (positive) if it has the highest IoU with a ground truth box or an IoU ; otherwise, we set it to be negative if it has IoU , and we ignore it otherwise.

To train the RPN, we optimize objectness loss and coordinate loss,

where if the anchor is positive and if the anchor is negative. is a vector containing the regressed coordinates of the predicted bounding box,

where are the parameters of the anchor box and are the predicted parameters. is defined similarly except we replace with denoting the ground truth box coordinates.

is the log loss, and uses the smooth loss

where in our case .

Detection Network (Fast R-CNN)

The detection network classifies proposed RoIs using Fast R-CNN. We take in the first feature map and RoIs and for each RoI projection on the feature map, we apply RoI pooling to create another fixed-size feature map; RoI pooling essentially performs max pooling but with a fixed output resolution. That is, given a feature projection of size , to create the fixed-size , we divide the input into sub-windows of size and get the max value in each sub-window.

This second feature map goes through some fully-connected layers and gives us the softmax class probabilities (including a probability for background) as well as bounding box regressors. These regressors are refined bounding box coordinates, specifically defined as regression offsets similar to the anchor box offsets as defined above.

Our loss consists of classification and regression terms,

where is our predicted probabilities, is the true class, is our predicted regressors, and is the true regressors. Note that we only penalize regressors (the second term) that are assigned to a ground truth object. Like above, our two components are specifically the log loss and smooth loss respectively.