What is Mask R-CNN?

Mask R-CNN, or Mask Region-based Convolutional Neural Network, is an extension of the Faster R-CNN object detection method, which is used in computer vision for both object recognition and instance segmentation.

The following are the key differences between Mask R-CNN and Faster R-CNN:

To address the issue of misalignments between the input feature map, ROIPool was replaced by ROIAlign.
The pooling grid for the area of interest (ROI);
The use of the Feature Pyramid Network (FPN), which extends the capabilities of the Mask R-CNN by giving a multi-scale feature representation, allowing efficient feature reuse, and addressing scale fluctuations in objects.

What distinguishes Mask R-CNN is its ability to correctly segment and identify the pixel-wise boundaries of each item inside an image. This fine-grained segmentation capability is achieved by adding an additional "mask head" branch to the Faster R-CNN model.

Mask R-CNN is an object identification and instance segmentation deep learning model. It is an improvement on the Faster R-CNN architecture.

Mask R-CNN's primary novelty is its ability to do pixel-wise instance segmentation with object recognition. This is accomplished by including an additional "mask head" branch, which creates exact segmentation masks for each recognised item. Fine-grained pixel-level boundaries are enabled, allowing for precise and thorough instance segmentation.

ROIAlign and Feature Pyramid Network (FPN) are two significant innovations built into Mask R-CNN. ROIAlign overcomes the constraints of the classic ROI pooling approach by including bilinear interpolation into the pooling process. This reduces misalignment difficulties and assures correct spatial information acquisition from the input feature map, resulting in better segmentation accuracy, especially for tiny objects.

By generating a multi-scale feature pyramid, FPN plays a critical role in feature extraction. This pyramid integrates information from many scales, allowing the model to obtain a more thorough grasp of object context and allowing for improved object recognition and segmentation across a wide variety of object sizes.

Mask R-CNN Architecture

Mask R-CNN's design is based on the Faster R-CNN architecture, with the addition of an additional "mask head" branch enabling pixel-wise segmentation. The overall architecture is made up of numerous major components:

Backbone Network

Mask R-CNN's backbone network is often a pre-trained convolutional neural network, such as ResNet or ResNeXt. This backbone is responsible for processing the input picture and extracting high-level information. Then, on top of this backbone network, an FPN is constructed to form a feature pyramid.

FPNs are intended to handle the issue of dealing with items of varied sizes and scales in a picture. By merging features from several levels of the backbone network, the FPN design forms a multi-scale feature pyramid. This pyramid has features with varied spatial resolutions, ranging from high-resolution features with extensive semantic information to low-resolution features with finer spatial details.

In Mask R-CNN, the FPN comprises of the following steps:

Extraction of High-Level features: The backbone network retrieves high-level characteristics from the input picture.
Feature fusion: To construct a top-down pathway, FPN connects multiple tiers of the backbone network. This top-down approach integrates high-level semantic information with lower-level feature maps, allowing the model to reuse features at various sizes.
Feature Pyramid: The fusion process forms a multi-scale feature pyramid, with each level of the pyramid representing a different feature resolution. The highest-resolution features are found at the top of the pyramid, while the lowest-resolution features are found at the bottom.

Mask R-CNN can successfully handle objects of varying sizes thanks to the feature pyramid formed by FPN. The model can collect contextual information and reliably recognise items at multiple scales within the image thanks to its multi-scale representation.

Region Proposal Network (RPN)

The RPN is in charge of producing region proposals or candidate bounding boxes that may include objects in the picture. It operates on the backbone network's feature map and suggests prospective locations of interest.

ROIAlign

The ROIAlign (Region of Interest Align) layer is inserted after the RPN creates region proposals. This step aids in overcoming the misalignment problem in ROI pooling.

ROIAlign is critical in extracting features from the input feature map precisely for each area proposal, guaranteeing perfect pixel-wise segmentation in instance segmentation tasks.

The primary goal of ROIAlign is to align the features inside a region of interest (ROI) with the output feature map's spatial grid. This alignment is critical to avoiding information loss that might occur when the ROI's spatial coordinates are quantized to the closest integer (as in ROI pooling).

The ROIAlign process involves the following steps:

Input Feature Map: The input feature map, which is generally collected from the backbone network, is the first step in the process. This feature map offers comprehensive semantic information about the whole image.
Region Proposals: The RPN creates region proposals (candidate bounding boxes) that may include items of interest inside the picture.
Division into Grids: Each area suggestion is split into a specified number of spatial bins or grids of equal size. These grids are used to extract features related to the region of interest from the input feature map.
Bilinear Interpolation: Unlike ROI pooling, which quantizes the spatial coordinates of the grids to the closest integer, ROIAlign calculates the pooling contributions for each grid via bilinear interpolation. This interpolation guarantees that the features inside the ROI are aligned more precisely.
Output Features: The features from the input feature map that are aligned with each grid in the output feature map serve as the representative features for each area proposal. Fine-grained spatial information is captured by these aligned features, which is critical for effective segmentation.

ROIAlign considerably increases the accuracy of feature extraction for each area proposal by applying bilinear interpolation during the pooling phase, reducing misalignment concerns.

Because of this perfect alignment, Mask R-CNN can build more accurate segmentation masks, which is especially useful for tiny objects or regions where minute features must be maintained. As a consequence, ROIAlign adds to Mask R-CNN's outstanding performance in instance segmentation tasks.

Mask Head

The Mask Head is a new branch in Mask R-CNN that is in charge of creating segmentation masks for each area proposal. The aligned features acquired by ROIAlign are used by the head to forecast a binary mask for each object, outlining the pixel-wise bounds of the instances. Typically, the Mask Head is made up of multiple convolutional layers followed by upsample layers (deconvolution or transposed convolution layers).

The model is jointly optimised during training using a mixture of classification loss, bounding box regression loss, and mask segmentation loss. This enables the model to learn to recognise objects while also refining their bounding boxes and producing precise segmentation masks.

Mask R-CNN Performance

In the table below we show the instance segmentation Mask R-CNN performance and some visual results on COCO test dataset.

The MNC and FCIS models won the COCO 2015 and 2016 segmentation competitions, respectively. Surprisingly, Mask R-CNN outperformed the more complex FCIS+++, which includes multi-scale training/testing, horizontal flip testing, and OHEM. It's worth noting that all of the entries are the results of separate models.

Mask R-CNN Limitations

Mask R-CNN excels in a variety of domains, making it an effective model for a wide range of computer vision applications such as object recognition, instance segmentation, multi-object segmentation, and complicated scene processing.

However, there are several drawbacks to Mask R-CNN to consider:

Small Object Segmentation: Due of inadequate pixel information, Mask R-CNN may fail to effectively separate very tiny objects.
Computational Complexity: Training and inference may be computationally demanding, necessitating significant resources, particularly for high-resolution pictures or big datasets.
Data Requirements: Training Mask R-CNN needs a huge quantity of annotated data, which may be time-consuming and costly to obtain.
Limited Generalization: The model's capacity to generalise to previously encountered item categories is restricted, particularly when data is sparse.

Conclusion

Mask R-CNN combines object identification and instance segmentation, allowing it to recognise objects while also precisely delineating their borders at the pixel level. Mask R-CNN delivers high performance and accuracy by combining a Feature Pyramid Network (FPN) with the Region of Interest Align (ROIAlign).

Mask R-CNN has certain disadvantages, such as computational complexity and memory consumption during training and inference. It may have difficulty properly segmenting very tiny objects or dealing with severely obstructed situations. Acquiring a large amount of annotated training data can also be difficult, and fine-tuning the model for certain domains may need rigorous parameter adjustment.