🤖 AI Summary
This work proposes SGR-OCC, a novel framework for 3D semantic occupancy prediction from monocular video, addressing two key challenges: boundary “feature leakage” caused by depth ambiguity and the disruption of spatial priors due to “cold-start” issues in temporal fusion modules. Inspired by the principle of “inheritance and evolution,” the method introduces a soft-gating mechanism to model depth uncertainty, effectively suppressing background noise, and simplifies 3D displacement refinement to a one-dimensional depth correction along camera rays. To mitigate cold-start effects, a two-stage progressive training strategy with identity initialization is employed. The approach achieves state-of-the-art performance on EmbodiedOcc-ScanNet and Occ-ScanNet, attaining a completion IoU of 58.55% and a semantic mIoU of 49.89% on local tasks.
📝 Abstract
3D semantic occupancy prediction is a cornerstone for embodied AI, enabling agents to perceive dense scene geometry and semantics incrementally from monocular video streams. However, current online frameworks face two critical bottlenecks: the inherent depth ambiguity of monocular estimation that causes "feature bleeding" at object boundaries , and the "cold start" instability where uninitialized temporal fusion layers distort high-quality spatial priors during early training stages. In this paper, we propose SGR-OCC (Soft-Gating and Ray-refinement Occupancy), a unified framework driven by the philosophy of "Inheritance and Evolution". To perfectly inherit monocular spatial expertise, we introduce a Soft-Gating Feature Lifter that explicitly models depth uncertainty via a Gaussian gate to probabilistically suppress background noise. Furthermore, a Dynamic Ray-Constrained Anchor Refinement module simplifies complex 3D displacement searches into efficient 1D depth corrections along camera rays, ensuring sub-voxel adherence to physical surfaces. To ensure stable evolution toward temporal consistency, we employ a Two-Phase Progressive Training Strategy equipped with identity-initialized fusion, effectively resolving the cold start problem and shielding spatial priors from noisy early gradients. Extensive experiments on the EmbodiedOcc-ScanNet and Occ-ScanNet benchmarks demonstrate that SGR-OCC achieves state-of-the-art performance. In local prediction tasks, SGR-OCC achieves a completion IoU of 58.55$\%$ and a semantic mIoU of 49.89$\%$, surpassing the previous best method, EmbodiedOcc++, by 3.65$\%$ and 3.69$\%$ respectively. In challenging embodied prediction tasks, our model reaches 55.72$\%$ SC-IoU and 46.22$\%$ mIoU. Qualitative results further confirm our model's superior capability in preserving structural integrity and boundary sharpness in complex indoor environments.