π€ AI Summary
This work proposes the first end-to-end fully self-supervised method for joint 3D occupancy and motion estimation in autonomous driving, addressing the reliance on costly human annotations or external supervision. By decoupling static and dynamic signed distance fields and incorporating temporal feature aggregation with a cosine similarity constraint on features, the model implicitly learns scene dynamics without any manual labels. The approach is validated on SemanticKITTI, KITTI-MOT, and nuScenes datasets, demonstrating significant reduction in dependence on annotated data. Furthermore, it introduces a strong self-supervised optical flow cue derived from feature similarity, advancing the state of self-supervised modeling of dynamic 3D scenes.
π Abstract
Estimating 3D occupancy and motion at the vehicle's surroundings is essential for autonomous driving, enabling situational awareness in dynamic environments. Existing approaches jointly learn geometry and motion but rely on expensive 3D occupancy and flow annotations, velocity labels from bounding boxes, or pretrained optical flow models. We propose a self-supervised method for 3D occupancy flow estimation that eliminates the need for human-produced annotations or external flow supervision. Our method disentangles the scene into separate static and dynamic signed distance fields and learns motion implicitly through temporal aggregation. Additionally, we introduce a strong self-supervised flow cue derived from features' cosine similarities. We demonstrate the efficacy of our 3D occupancy flow method on SemanticKITTI, KITTI-MOT, and nuScenes.