🤖 AI Summary
Monocular 3D semantic scene completion (SSC) suffers from low voxel prediction density and weak occlusion modeling due to sparse 2D supervision. To address these issues, this paper proposes a high-dimensional semantic disentanglement and high-density occupancy optimization framework. Methodologically: (1) a pseudo-3D feature expansion module is designed to disentangle semantic and occlusion representations along the depth dimension; (2) a detection-refinement occupancy optimization architecture is introduced, integrating context-aware geometric-semantic modeling with detection-guided voxel completion and correction. Experiments on SemanticKITTI and SSCBench-KITTI-360 demonstrate substantial improvements in voxel-level occupancy density and semantic accuracy. Our method achieves state-of-the-art performance in both mIoU and Occ-mIoU, with particularly notable gains in distant and occluded regions. This advances robust, dense 3D scene understanding for autonomous driving.
📝 Abstract
Camera-based 3D semantic scene completion (SSC) plays a crucial role in autonomous driving, enabling voxelized 3D scene understanding for effective scene perception and decision-making. Existing SSC methods have shown efficacy in improving 3D scene representations, but suffer from the inherent input-output dimension gap and annotation-reality density gap, where the 2D planner view from input images with sparse annotated labels leads to inferior prediction of real-world dense occupancy with a 3D stereoscopic view. In light of this, we propose the corresponding High-Dimension High-Density Semantic Scene Completion (HD$^2$-SSC) framework with expanded pixel semantics and refined voxel occupancies. To bridge the dimension gap, a High-dimension Semantic Decoupling module is designed to expand 2D image features along a pseudo third dimension, decoupling coarse pixel semantics from occlusions, and then identify focal regions with fine semantics to enrich image features. To mitigate the density gap, a High-density Occupancy Refinement module is devised with a"detect-and-refine"architecture to leverage contextual geometric and semantic structures for enhanced semantic density with the completion of missing voxels and correction of erroneous ones. Extensive experiments and analyses on the SemanticKITTI and SSCBench-KITTI-360 datasets validate the effectiveness of our HD$^2$-SSC framework.