Monocular Semantic Scene Completion via Masked Recurrent Networks

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Monocular Semantic Scene Completion (MSSC) aims to jointly predict voxel-level occupancy and semantic labels from a single RGB image, yet existing methods suffer from coupled optimization between visible-region segmentation and occluded-region hallucination, and are highly sensitive to depth estimation errors. This paper proposes a two-stage framework: the first stage generates coarse scene completion; the second stage introduces a Masked Sparse Gated Recurrent Unit (GRU) and a distance-aware attentional projection mechanism to dynamically focus computation on occupied voxels and suppress projection distortions on sparse voxel grids. Coupled with an iterative visibility mask update strategy, the method significantly improves geometric and semantic reconstruction accuracy in occluded regions under complex scenes. Our approach achieves state-of-the-art performance on NYUv2 and SemanticKITTI benchmarks, generalizes across indoor and outdoor scenarios, and the source code is publicly available.

Technology Category

Application Category

📝 Abstract
Monocular Semantic Scene Completion (MSSC) aims to predict the voxel-wise occupancy and semantic category from a single-view RGB image. Existing methods adopt a single-stage framework that aims to simultaneously achieve visible region segmentation and occluded region hallucination, while also being affected by inaccurate depth estimation. Such methods often achieve suboptimal performance, especially in complex scenes. We propose a novel two-stage framework that decomposes MSSC into coarse MSSC followed by the Masked Recurrent Network. Specifically, we propose the Masked Sparse Gated Recurrent Unit (MS-GRU) which concentrates on the occupied regions by the proposed mask updating mechanism, and a sparse GRU design is proposed to reduce the computation cost. Additionally, we propose the distance attention projection to reduce projection errors by assigning different attention scores according to the distance to the observed surface. Experimental results demonstrate that our proposed unified framework, MonoMRN, effectively supports both indoor and outdoor scenes and achieves state-of-the-art performance on the NYUv2 and SemanticKITTI datasets. Furthermore, we conduct robustness analysis under various disturbances, highlighting the role of the Masked Recurrent Network in enhancing the model's resilience to such challenges. The source code is publicly available.
Problem

Research questions and friction points this paper is trying to address.

Predict voxel occupancy and semantics from single RGB images
Overcome inaccuracies in depth estimation and scene complexity
Enhance robustness to disturbances in indoor and outdoor scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework for monocular semantic scene completion
Masked Sparse GRU focuses on occupied regions efficiently
Distance attention projection reduces surface projection errors
🔎 Similar Papers
No similar papers found.
X
Xuzhi Wang
Tianjin Normal University
X
Xinran Wu
Tianjin Normal University
S
Song Wang
Zhejiang University
Lingdong Kong
Lingdong Kong
National University of Singapore
Computer VisionDeep Learning
Ziping Zhao
Ziping Zhao
Tianjin Normal University
Affective Computing