🤖 AI Summary
3D semantic occupancy prediction heavily relies on costly, densely annotated 3D ground-truth labels, hindering scalability in autonomous driving. Method: This paper proposes a multimodal (camera + LiDAR) semi-supervised framework comprising two novel stages: (1) leveraging vision foundation models to generate high-quality cross-modal pseudo-labels from readily available accumulated LiDAR sweeps and RGB images; and (2) performing joint optimization via early multimodal fusion and sparse convolutional networks. Contribution/Results: The framework requires only a small fraction of 3D annotations for initialization. On SemanticKITTI, it reduces annotation effort by 90% while maintaining state-of-the-art accuracy and real-time inference capability. This significantly enhances practical deployability in real-world driving scenarios.
📝 Abstract
Developing 3D semantic occupancy prediction models often relies on dense 3D annotations for supervised learning, a process that is both labor and resource-intensive, underscoring the need for label-efficient or even label-free approaches. To address this, we introduce MinkOcc, a multi-modal 3D semantic occupancy prediction framework for cameras and LiDARs that proposes a two-step semi-supervised training procedure. Here, a small dataset of explicitly 3D annotations warm-starts the training process; then, the supervision is continued by simpler-to-annotate accumulated LiDAR sweeps and images -- semantically labelled through vision foundational models. MinkOcc effectively utilizes these sensor-rich supervisory cues and reduces reliance on manual labeling by 90% while maintaining competitive accuracy. In addition, the proposed model incorporates information from LiDAR and camera data through early fusion and leverages sparse convolution networks for real-time prediction. With its efficiency in both supervision and computation, we aim to extend MinkOcc beyond curated datasets, enabling broader real-world deployment of 3D semantic occupancy prediction in autonomous driving.