Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses two key challenges in surround-view camera-based 3D semantic occupancy prediction: geometric misalignment caused by inaccurate depth estimation and extreme spatial imbalance in semantic distribution. To this end, the authors propose Dr.Occ, a novel framework that leverages depth-guided 2D-to-3D view transformation to achieve precise voxel alignment and introduces a region-wise Mixture-of-Experts (MoE) Transformer to adaptively model spatial semantic heterogeneity. High-quality dense depth maps from MoGe-2 are utilized to construct geometric priors, which synergistically enhance occupancy prediction accuracy. Evaluated under the pure vision setting on Occ3D-nuScenes, Dr.Occ achieves a 7.43% improvement in mIoU and a 3.09% gain in IoU, significantly outperforming the BEVDet4D baseline.

Technology Category

Application Category

📝 Abstract

3D semantic occupancy prediction is crucial for autonomous driving perception, offering comprehensive geometric scene understanding and semantic recognition. However, existing methods struggle with geometric misalignment in view transformation due to the lack of pixel-level accurate depth estimation, and severe spatial class imbalance where semantic categories exhibit strong spatial anisotropy. To address these challenges, we propose Dr.Occ, a depth- and region-guided occupancy prediction framework. Specifically, we introduce a depth-guided 2D-to-3D View Transformer (D$^2$-VFormer) that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors, thereby enabling precise geometric alignment of voxel features. Moreover, inspired by the Mixture-of-Experts (MoE) framework, we propose a region-guided Expert Transformer (R/R$^2$-EFormer) that adaptively allocates region-specific experts to focus on different spatial regions, effectively addressing spatial semantic variations. Thus, the two components make complementary contributions: depth guidance ensures geometric alignment, while region experts enhance semantic learning. Experiments on the Occ3D-nuScenes benchmark demonstrate that \textbf{Dr.Occ} improves the strong baseline BEVDet4D by 7.43\% mIoU and 3.09\% IoU under the full vision-only setting.

Problem

Research questions and friction points this paper is trying to address.

3D semantic occupancy

geometric misalignment

spatial class imbalance

depth estimation

autonomous driving perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D occupancy prediction

depth-guided view transformation

region-guided expert transformer