SGR-OCC: Evolving Monocular Priors for Embodied 3D Occupancy Prediction via Soft-Gating Lifting and Semantic-Adaptive Geometric Refinement

📅 2026-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes SGR-OCC, a novel framework for 3D semantic occupancy prediction from monocular video, addressing two key challenges: boundary “feature leakage” caused by depth ambiguity and the disruption of spatial priors due to “cold-start” issues in temporal fusion modules. Inspired by the principle of “inheritance and evolution,” the method introduces a soft-gating mechanism to model depth uncertainty, effectively suppressing background noise, and simplifies 3D displacement refinement to a one-dimensional depth correction along camera rays. To mitigate cold-start effects, a two-stage progressive training strategy with identity initialization is employed. The approach achieves state-of-the-art performance on EmbodiedOcc-ScanNet and Occ-ScanNet, attaining a completion IoU of 58.55% and a semantic mIoU of 49.89% on local tasks.

Technology Category

Application Category

📝 Abstract
3D semantic occupancy prediction is a cornerstone for embodied AI, enabling agents to perceive dense scene geometry and semantics incrementally from monocular video streams. However, current online frameworks face two critical bottlenecks: the inherent depth ambiguity of monocular estimation that causes "feature bleeding" at object boundaries , and the "cold start" instability where uninitialized temporal fusion layers distort high-quality spatial priors during early training stages. In this paper, we propose SGR-OCC (Soft-Gating and Ray-refinement Occupancy), a unified framework driven by the philosophy of "Inheritance and Evolution". To perfectly inherit monocular spatial expertise, we introduce a Soft-Gating Feature Lifter that explicitly models depth uncertainty via a Gaussian gate to probabilistically suppress background noise. Furthermore, a Dynamic Ray-Constrained Anchor Refinement module simplifies complex 3D displacement searches into efficient 1D depth corrections along camera rays, ensuring sub-voxel adherence to physical surfaces. To ensure stable evolution toward temporal consistency, we employ a Two-Phase Progressive Training Strategy equipped with identity-initialized fusion, effectively resolving the cold start problem and shielding spatial priors from noisy early gradients. Extensive experiments on the EmbodiedOcc-ScanNet and Occ-ScanNet benchmarks demonstrate that SGR-OCC achieves state-of-the-art performance. In local prediction tasks, SGR-OCC achieves a completion IoU of 58.55$\%$ and a semantic mIoU of 49.89$\%$, surpassing the previous best method, EmbodiedOcc++, by 3.65$\%$ and 3.69$\%$ respectively. In challenging embodied prediction tasks, our model reaches 55.72$\%$ SC-IoU and 46.22$\%$ mIoU. Qualitative results further confirm our model's superior capability in preserving structural integrity and boundary sharpness in complex indoor environments.
Problem

Research questions and friction points this paper is trying to address.

3D semantic occupancy prediction
monocular depth ambiguity
feature bleeding
cold start instability
embodied AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft-Gating Feature Lifting
Ray-Constrained Geometric Refinement
Temporal Fusion Cold Start
Monocular 3D Occupancy Prediction
Progressive Training Strategy
🔎 Similar Papers
No similar papers found.