🤖 AI Summary
This work addresses the structural instability in existing monocular 3D semantic scene completion methods, which stems from the lack of explicit modeling of voxel feature reliability and ineffective cross-scale information regulation, leading to projection diffusion and feature entanglement. Building upon the MonoScene framework, we propose a parameter-free adaptive multi-scale channel-spatial parallel attention mechanism to calibrate voxel feature reliability, along with a hierarchical adaptive gating strategy to stabilize multi-scale feature fusion between the encoder and decoder. Evaluated on the NYUv2 benchmark, our approach achieves an SSC mIoU of 27.25% (+0.31) and an SC IoU of 43.10% (+0.59), while remaining efficiently deployable on NVIDIA Jetson embedded platforms.
📝 Abstract
In indoor assistive perception for visually impaired users, 3D Semantic Scene Completion (SSC) is expected to provide structurally coherent and semantically consistent occupancy under strictly monocular vision for safety-critical scene understanding. However, existing monocular SSC approaches often lack explicit modeling of voxel-feature reliability and regulated cross-scale information propagation during 2D-3D projection and multi-scale fusion, making them vulnerable to projection diffusion and feature entanglement and thus limiting structural stability.To address these challenges, this paper presents an Adaptive Multi-scale Attention Aggregation (AMAA) framework built upon the MonoScene pipeline. Rather than introducing a heavier backbone, AMAA focuses on reliability-oriented feature regulation within a monocular SSC framework. Specifically, lifted voxel features are jointly calibrated in semantic and spatial dimensions through parallel channel-spatial attention aggregation, while multi-scale encoder-decoder fusion is stabilized via a hierarchical adaptive feature-gating strategy that regulates information injection across scales.Experiments on the NYUv2 benchmark demonstrate consistent improvements over MonoScene without significantly increasing system complexity: AMAA achieves 27.25% SSC mIoU (+0.31) and 43.10% SC IoU (+0.59). In addition, system-level deployment on an NVIDIA Jetson platform verifies that the complete AMAA framework can be executed stably on embedded hardware. Overall, AMAA improves monocular SSC quality and provides a reliable and deployable perception framework for indoor assistive systems targeting visually impaired users.