April 2020
tl;dr: Aligned channel2spatial boosts the performance of instance segmentation than direct channel2spatial.
The paper proposes a relatively rigorous formulation for 4D tensor that unifies DeepMask and InstanceFCN into one framework. The paper seems to be overly complicated to convey two simple ideas: We need to align channel2spatial, and we need large masks for large objects.
The key question to dense instance segmentation: why cannot we naively adopt CenterNet architecture for instance segmentation?
The answer is that training a neural network with
- Each mask is a HxW tensor. Dense prediction would require a 4D tensor representation, HxWxHxW. First two dimension are at each physical location. The latter two dimensions are the mask dimensions.
- Two main ideas:
- First, channel2spatial can have Natural representation (direct channel2spatial) or Aligned representation (aligned channel2spatial). The authors demonstrated that aligned representation is one key ingredient to achieve better performance for dense mask prediction.
- Second, being able to predict large masks for large object (tensor bipyramid) boosts instance segmentation performance. <-- using constant resolution mask in MaskRCNN is one bottleneck for segmenting large objects.
- Summary of technical details
- Questions and notes on how to improve/revise the current work