MS-Occ: Multi-Stage LiDAR-Camera Fusion for 3D Semantic Occupancy Prediction

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Addressing the challenge of balancing geometric accuracy and semantic richness in 3D semantic occupancy prediction for autonomous driving, this paper proposes a mid- and late-stage collaborative LiDAR-camera multi-stage fusion framework. Our key contributions are: (1) Gaussian geometry-aware rendering to enhance image features and improve depth consistency; (2) semantic-aware deformable cross-attention for fine-grained cross-modal alignment; and (3) adaptive voxel-weighted fusion coupled with a high-confidence voxel self-attention refinement module to strengthen small-object modeling and cross-modal consistency. Evaluated on nuScenes-OpenOccupancy, our method achieves 32.1% IoU and 25.3% mIoU—surpassing the state-of-the-art by +0.7% IoU and +2.4% mIoU—demonstrating significant improvements in small-object detection and joint geometric-semantic modeling.

Technology Category

Application Category

📝 Abstract

Accurate 3D semantic occupancy perception is essential for autonomous driving in complex environments with diverse and irregular objects. While vision-centric methods suffer from geometric inaccuracies, LiDAR-based approaches often lack rich semantic information. To address these limitations, MS-Occ, a novel multi-stage LiDAR-camera fusion framework which includes middle-stage fusion and late-stage fusion, is proposed, integrating LiDAR's geometric fidelity with camera-based semantic richness via hierarchical cross-modal fusion. The framework introduces innovations at two critical stages: (1) In the middle-stage feature fusion, the Gaussian-Geo module leverages Gaussian kernel rendering on sparse LiDAR depth maps to enhance 2D image features with dense geometric priors, and the Semantic-Aware module enriches LiDAR voxels with semantic context via deformable cross-attention; (2) In the late-stage voxel fusion, the Adaptive Fusion (AF) module dynamically balances voxel features across modalities, while the High Classification Confidence Voxel Fusion (HCCVF) module resolves semantic inconsistencies using self-attention-based refinement. Experiments on the nuScenes-OpenOccupancy benchmark show that MS-Occ achieves an Intersection over Union (IoU) of 32.1% and a mean IoU (mIoU) of 25.3%, surpassing the state-of-the-art by +0.7% IoU and +2.4% mIoU. Ablation studies further validate the contribution of each module, with substantial improvements in small-object perception, demonstrating the practical value of MS-Occ for safety-critical autonomous driving scenarios.

Problem

Research questions and friction points this paper is trying to address.

Improves 3D semantic occupancy prediction for autonomous driving

Fuses LiDAR geometric accuracy with camera semantic richness

Addresses limitations in vision-centric and LiDAR-based methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage LiDAR-camera fusion framework

Gaussian-Geo module enhances image features

Adaptive Fusion balances voxel features dynamically

🔎 Similar Papers

No similar papers found.