Monocular Semantic Scene Completion via Masked Recurrent Networks

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Monocular Semantic Scene Completion (MSSC) aims to jointly predict voxel-level occupancy and semantic labels from a single RGB image, yet existing methods suffer from coupled optimization between visible-region segmentation and occluded-region hallucination, and are highly sensitive to depth estimation errors. This paper proposes a two-stage framework: the first stage generates coarse scene completion; the second stage introduces a Masked Sparse Gated Recurrent Unit (GRU) and a distance-aware attentional projection mechanism to dynamically focus computation on occupied voxels and suppress projection distortions on sparse voxel grids. Coupled with an iterative visibility mask update strategy, the method significantly improves geometric and semantic reconstruction accuracy in occluded regions under complex scenes. Our approach achieves state-of-the-art performance on NYUv2 and SemanticKITTI benchmarks, generalizes across indoor and outdoor scenarios, and the source code is publicly available.

Technology Category

Application Category

📝 Abstract

Monocular Semantic Scene Completion (MSSC) aims to predict the voxel-wise occupancy and semantic category from a single-view RGB image. Existing methods adopt a single-stage framework that aims to simultaneously achieve visible region segmentation and occluded region hallucination, while also being affected by inaccurate depth estimation. Such methods often achieve suboptimal performance, especially in complex scenes. We propose a novel two-stage framework that decomposes MSSC into coarse MSSC followed by the Masked Recurrent Network. Specifically, we propose the Masked Sparse Gated Recurrent Unit (MS-GRU) which concentrates on the occupied regions by the proposed mask updating mechanism, and a sparse GRU design is proposed to reduce the computation cost. Additionally, we propose the distance attention projection to reduce projection errors by assigning different attention scores according to the distance to the observed surface. Experimental results demonstrate that our proposed unified framework, MonoMRN, effectively supports both indoor and outdoor scenes and achieves state-of-the-art performance on the NYUv2 and SemanticKITTI datasets. Furthermore, we conduct robustness analysis under various disturbances, highlighting the role of the Masked Recurrent Network in enhancing the model's resilience to such challenges. The source code is publicly available.

Problem

Research questions and friction points this paper is trying to address.

Predict voxel occupancy and semantics from single RGB images

Overcome inaccuracies in depth estimation and scene complexity

Enhance robustness to disturbances in indoor and outdoor scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework for monocular semantic scene completion

Masked Sparse GRU focuses on occupied regions efficiently

Distance attention projection reduces surface projection errors

🔎 Similar Papers

No similar papers found.