MS-Occ: Multi-Stage LiDAR-Camera Fusion for 3D Semantic Occupancy Prediction

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of balancing geometric accuracy and semantic richness in 3D semantic occupancy prediction for autonomous driving, this paper proposes a mid- and late-stage collaborative LiDAR-camera multi-stage fusion framework. Our key contributions are: (1) Gaussian geometry-aware rendering to enhance image features and improve depth consistency; (2) semantic-aware deformable cross-attention for fine-grained cross-modal alignment; and (3) adaptive voxel-weighted fusion coupled with a high-confidence voxel self-attention refinement module to strengthen small-object modeling and cross-modal consistency. Evaluated on nuScenes-OpenOccupancy, our method achieves 32.1% IoU and 25.3% mIoU—surpassing the state-of-the-art by +0.7% IoU and +2.4% mIoU—demonstrating significant improvements in small-object detection and joint geometric-semantic modeling.

Technology Category

Application Category

📝 Abstract
Accurate 3D semantic occupancy perception is essential for autonomous driving in complex environments with diverse and irregular objects. While vision-centric methods suffer from geometric inaccuracies, LiDAR-based approaches often lack rich semantic information. To address these limitations, MS-Occ, a novel multi-stage LiDAR-camera fusion framework which includes middle-stage fusion and late-stage fusion, is proposed, integrating LiDAR's geometric fidelity with camera-based semantic richness via hierarchical cross-modal fusion. The framework introduces innovations at two critical stages: (1) In the middle-stage feature fusion, the Gaussian-Geo module leverages Gaussian kernel rendering on sparse LiDAR depth maps to enhance 2D image features with dense geometric priors, and the Semantic-Aware module enriches LiDAR voxels with semantic context via deformable cross-attention; (2) In the late-stage voxel fusion, the Adaptive Fusion (AF) module dynamically balances voxel features across modalities, while the High Classification Confidence Voxel Fusion (HCCVF) module resolves semantic inconsistencies using self-attention-based refinement. Experiments on the nuScenes-OpenOccupancy benchmark show that MS-Occ achieves an Intersection over Union (IoU) of 32.1% and a mean IoU (mIoU) of 25.3%, surpassing the state-of-the-art by +0.7% IoU and +2.4% mIoU. Ablation studies further validate the contribution of each module, with substantial improvements in small-object perception, demonstrating the practical value of MS-Occ for safety-critical autonomous driving scenarios.
Problem

Research questions and friction points this paper is trying to address.

Improves 3D semantic occupancy prediction for autonomous driving
Fuses LiDAR geometric accuracy with camera semantic richness
Addresses limitations in vision-centric and LiDAR-based methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage LiDAR-camera fusion framework
Gaussian-Geo module enhances image features
Adaptive Fusion balances voxel features dynamically
🔎 Similar Papers
No similar papers found.
Zhiqiang Wei
Zhiqiang Wei
Xi'an Jiaotong University
OTFSNOMAResource Allocation Design
Lianqing Zheng
Lianqing Zheng
Tongji University Ph.D student
BEV/OCCVLA4D Radar PerceptionMultimodal FusionData Closed-Loop
Jianan Liu
Jianan Liu
Unknown affiliation
Signal ProcessingDeep LearningSensing and PerceptionAutonomous DrivingMedical Imaging
T
Tao Huang
College of Science and Engineering, James Cook University, Cairns, QLD 4878, Australia
Q
Qing-Long Han
School of Engineering, Swinburne University of Technology, Melbourne, VIC 3122, Australia
Wenwen Zhang
Wenwen Zhang
Associate Professor at Rutgers University
City PlanningAutonomous VehiclesShared MobilityEnergyUrban Informatics
F
Fengdeng Zhang
School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China