🤖 AI Summary
Monocular semantic scene completion (SSC) suffers from severe underestimation of distant geometric structures due to perspective distortion and occlusion. To address this, we propose ScanSSC—a novel end-to-end framework mapping monocular images to 3D semantic voxel grids. Its core contributions are: (1) a tri-axial voxel scanning mechanism that enhances distant voxel awareness of near-field contextual cues; (2) axial near-to-far cascaded masked self-attention, enabling spatially selective feature modeling; and (3) Scan Loss, which accumulates logits along each axis to provide gradient-guided optimization for distant regions. Evaluated on SemanticKITTI and SSCBench-KITTI-360, ScanSSC achieves IoU scores of 44.54 and 48.29, and mIoU scores of 17.40 and 20.14, respectively—setting new state-of-the-art performance for camera-based SSC.
📝 Abstract
Camera-based Semantic Scene Completion (SSC) is gaining attentions in the 3D perception field. However, properties such as perspective and occlusion lead to the underestimation of the geometry in distant regions, posing a critical issue for safety-focused autonomous driving systems. To tackle this, we propose ScanSSC, a novel camera-based SSC model composed of a Scan Module and Scan Loss, both designed to enhance distant scenes by leveraging context from near-viewpoint scenes. The Scan Module uses axis-wise masked attention, where each axis employing a near-to-far cascade masking that enables distant voxels to capture relationships with preceding voxels. In addition, the Scan Loss computes the cross-entropy along each axis between cumulative logits and corresponding class distributions in a near-to-far direction, thereby propagating rich context-aware signals to distant voxels. Leveraging the synergy between these components, ScanSSC achieves state-of-the-art performance, with IoUs of 44.54 and 48.29, and mIoUs of 17.40 and 20.14 on the SemanticKITTI and SSCBench-KITTI-360 benchmarks.