Skip to content

Latest commit

 

History

History
32 lines (23 loc) · 1.83 KB

bev_feat_stitching.md

File metadata and controls

32 lines (23 loc) · 1.83 KB

January 2021

tl;dr: predict BEV semantic maps from a single monocular video.

Overall impression

Previous SOTA PyrOccNet and Lift splat shoot studies how to combine synchronized images from multiple cameras into a coherent 360 deg BEV map. BEV-feat-stitching try to stitch monocular video into a coherent BEV map. This process also requires knowledge of the camera pose sequence.

The mapping of the intermediate feature map resembles that of feature-metric mono depth and feature-metric distance in 3DSSD.

To be honest the results do not look as clean as PyrOccNet. Future work may be to combine these two trends, from both BEV-feat-stitching and PyrOccNet.

This paper has a follow-up work STSU for structured BEV perception.

Key ideas

  • Takes in mono video as input
  • BEV temporal aggregation module
    • Project the features to BEV space
    • BEV aggregation (BEV feature stitching) with camera pose.
      • Aggregation is done in a unified BEV grid (extended BEV)
  • Intermediate feature supervision in camera space with reprojected BEV GT
    • Single frame object supervision
    • Multiple frames static class

Technical details

  • 200x200 pixels, 0.25 m/pixel, 50m x 50m
  • The addition of dynamic classes helps with the static classes.

Notes

  • The evaluation is still in mIoU, treating the problem as a semantic segmentation issue. However we perhaps should introduce the idea of instance segmentation for better prediction and planning.
  • Stitching may have some noise with extrinsics and pose estimation and deep learning helps smooths this out.