🤖 AI Summary
This work addresses two key challenges in surround-view camera-based 3D semantic occupancy prediction: geometric misalignment caused by inaccurate depth estimation and extreme spatial imbalance in semantic distribution. To this end, the authors propose Dr.Occ, a novel framework that leverages depth-guided 2D-to-3D view transformation to achieve precise voxel alignment and introduces a region-wise Mixture-of-Experts (MoE) Transformer to adaptively model spatial semantic heterogeneity. High-quality dense depth maps from MoGe-2 are utilized to construct geometric priors, which synergistically enhance occupancy prediction accuracy. Evaluated under the pure vision setting on Occ3D-nuScenes, Dr.Occ achieves a 7.43% improvement in mIoU and a 3.09% gain in IoU, significantly outperforming the BEVDet4D baseline.
📝 Abstract
3D semantic occupancy prediction is crucial for autonomous driving perception, offering comprehensive geometric scene understanding and semantic recognition. However, existing methods struggle with geometric misalignment in view transformation due to the lack of pixel-level accurate depth estimation, and severe spatial class imbalance where semantic categories exhibit strong spatial anisotropy. To address these challenges, we propose Dr.Occ, a depth- and region-guided occupancy prediction framework. Specifically, we introduce a depth-guided 2D-to-3D View Transformer (D$^2$-VFormer) that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors, thereby enabling precise geometric alignment of voxel features. Moreover, inspired by the Mixture-of-Experts (MoE) framework, we propose a region-guided Expert Transformer (R/R$^2$-EFormer) that adaptively allocates region-specific experts to focus on different spatial regions, effectively addressing spatial semantic variations. Thus, the two components make complementary contributions: depth guidance ensures geometric alignment, while region experts enhance semantic learning. Experiments on the Occ3D-nuScenes benchmark demonstrate that \textbf{Dr.Occ} improves the strong baseline BEVDet4D by 7.43\% mIoU and 3.09\% IoU under the full vision-only setting.