Mask R-CNN is an extension of ๐ Faster R-CNN for instance segmentation. We use the same region proposal network but attach a separate branch to the detection network for segmentation; thus, our detection network is now responsible for classification, bounding box regression, and segmentation.
The mask branch predicts a binary probability mask for each object class, and we use the classification branch to choose which mask to use. This design crucially decouples classification from segmentationโunlike semantic segmentation with a ๐ญ Fully Convolutional Networkโwhich allows each binary mask to specialize to its respective class.
To train this branch, we include
RoI Align
On top of the new branch, we need to modify the RoI pooling layer in the original Fast R-CNN since its coarse quantization disrupts segmentation accuracy, which relies on pixel-level precision. We use the same general idea of RoI pooling, but instead of taking the max, we sample four locations in each bin and compute the value via bilinear interpolation from nearby features. The results of these four samples are aggregated using either max or average.