MinkOcc: Towards real-time label-efficient semantic occupancy prediction

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

3D semantic occupancy prediction heavily relies on costly, densely annotated 3D ground-truth labels, hindering scalability in autonomous driving. Method: This paper proposes a multimodal (camera + LiDAR) semi-supervised framework comprising two novel stages: (1) leveraging vision foundation models to generate high-quality cross-modal pseudo-labels from readily available accumulated LiDAR sweeps and RGB images; and (2) performing joint optimization via early multimodal fusion and sparse convolutional networks. Contribution/Results: The framework requires only a small fraction of 3D annotations for initialization. On SemanticKITTI, it reduces annotation effort by 90% while maintaining state-of-the-art accuracy and real-time inference capability. This significantly enhances practical deployability in real-world driving scenarios.

Technology Category

Application Category

📝 Abstract

Developing 3D semantic occupancy prediction models often relies on dense 3D annotations for supervised learning, a process that is both labor and resource-intensive, underscoring the need for label-efficient or even label-free approaches. To address this, we introduce MinkOcc, a multi-modal 3D semantic occupancy prediction framework for cameras and LiDARs that proposes a two-step semi-supervised training procedure. Here, a small dataset of explicitly 3D annotations warm-starts the training process; then, the supervision is continued by simpler-to-annotate accumulated LiDAR sweeps and images -- semantically labelled through vision foundational models. MinkOcc effectively utilizes these sensor-rich supervisory cues and reduces reliance on manual labeling by 90% while maintaining competitive accuracy. In addition, the proposed model incorporates information from LiDAR and camera data through early fusion and leverages sparse convolution networks for real-time prediction. With its efficiency in both supervision and computation, we aim to extend MinkOcc beyond curated datasets, enabling broader real-world deployment of 3D semantic occupancy prediction in autonomous driving.

Problem

Research questions and friction points this paper is trying to address.

Reducing reliance on dense 3D annotations for semantic occupancy prediction

Enabling label-efficient training using multi-modal sensor data

Achieving real-time 3D occupancy prediction for autonomous driving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-step semi-supervised training with sparse annotations

Early fusion of LiDAR and camera data

Sparse convolution for real-time prediction

🔎 Similar Papers

OccRWKV: Rethinking Efficient 3D Semantic Occupancy Prediction with Linear Complexity