Skip to content

Latest commit

 

History

History
47 lines (38 loc) · 2.53 KB

monolayout.md

File metadata and controls

47 lines (38 loc) · 2.53 KB

June 2020

tl;dr: Predict BEV semantic maps from monocular images.

Overall impression

This is very similar to PyrOccNet.

monolayout uses self-generated ground truth by aggregating results throughout video (so-called temporal sensor fusion). HD Map GT is only used for evaluation.

The authors also listed tricks that did not work. This I think should be the recommended standard practice in future!

The discriminator-based adversarial training is taken one step further to exploit useful prior between vehicle and road layout by PYVA.

Key ideas

  • View transformation: VAE-like, the latent feature is called "shared context"
  • Detached dynamic layout and static layout.
    • Dynamic layout: this is more related to mono 3D MOD.
      • Instance label
    • Static layout is more related to what Tesla is doing.
    • Network predicts static or dynamic layout whether it is covered by the camera or not. This is quite different from the method used in PyrOccNet where occluded points are masked.
  • Architecture
    • One encoder, two decoder (dynamic + static)
      • The learned representation must implicitly disentangle the static parts and dynamic objects.
    • patch based discriminators
      • Plausible road geometries extracted from unpaired database of openstreetmap.
  • Generating training data via temporal sensor fusion
    • Use monodepth2 or lidar to lift RGB to point cloud.
    • With odometry info, aggregate and register the scene observation over time, to generate a more dense, noise free point cloud.
    • When using monodepth2, discard anything 5 m away from the ego car as they could be noisy.
    • Aggregate 40-50 frames.
    • Use GT or predicted semantic labels and aggregate into occupancy grid by majority voting.
  • Compare with pseudo-lidar, monolayout can achieve equal or better results but much faster.
  • This work is easily extended to be converted to a behavior predictor.

Technical details

  • 40 x 40 m, 128 x 128 grid.
  • Realtime, 30Hz on GTX 1080 Ti.
  • Argoverse contains high-res semantic occupancy grid in BEV.
  • Things the authors tried but did not work
    • Using a single decoder to decode both dynamic and static layout.
  • Drawbacks: shadows will make the network into predicting protrusions along the shadow direction.

Notes