SGR-OCC: Evolving Monocular Priors for Embodied 3D Occupancy Prediction via Soft-Gating Lifting and Semantic-Adaptive Geometric Refinement

📅 2026-03-14

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work proposes SGR-OCC, a novel framework for 3D semantic occupancy prediction from monocular video, addressing two key challenges: boundary “feature leakage” caused by depth ambiguity and the disruption of spatial priors due to “cold-start” issues in temporal fusion modules. Inspired by the principle of “inheritance and evolution,” the method introduces a soft-gating mechanism to model depth uncertainty, effectively suppressing background noise, and simplifies 3D displacement refinement to a one-dimensional depth correction along camera rays. To mitigate cold-start effects, a two-stage progressive training strategy with identity initialization is employed. The approach achieves state-of-the-art performance on EmbodiedOcc-ScanNet and Occ-ScanNet, attaining a completion IoU of 58.55% and a semantic mIoU of 49.89% on local tasks.

Technology Category

Application Category

📝 Abstract

3D semantic occupancy prediction is a cornerstone for embodied AI, enabling agents to perceive dense scene geometry and semantics incrementally from monocular video streams. However, current online frameworks face two critical bottlenecks: the inherent depth ambiguity of monocular estimation that causes "feature bleeding" at object boundaries , and the "cold start" instability where uninitialized temporal fusion layers distort high-quality spatial priors during early training stages. In this paper, we propose SGR-OCC (Soft-Gating and Ray-refinement Occupancy), a unified framework driven by the philosophy of "Inheritance and Evolution". To perfectly inherit monocular spatial expertise, we introduce a Soft-Gating Feature Lifter that explicitly models depth uncertainty via a Gaussian gate to probabilistically suppress background noise. Furthermore, a Dynamic Ray-Constrained Anchor Refinement module simplifies complex 3D displacement searches into efficient 1D depth corrections along camera rays, ensuring sub-voxel adherence to physical surfaces. To ensure stable evolution toward temporal consistency, we employ a Two-Phase Progressive Training Strategy equipped with identity-initialized fusion, effectively resolving the cold start problem and shielding spatial priors from noisy early gradients. Extensive experiments on the EmbodiedOcc-ScanNet and Occ-ScanNet benchmarks demonstrate that SGR-OCC achieves state-of-the-art performance. In local prediction tasks, SGR-OCC achieves a completion IoU of 58.55$\%$ and a semantic mIoU of 49.89$\%$, surpassing the previous best method, EmbodiedOcc++, by 3.65$\%$ and 3.69$\%$ respectively. In challenging embodied prediction tasks, our model reaches 55.72$\%$ SC-IoU and 46.22$\%$ mIoU. Qualitative results further confirm our model's superior capability in preserving structural integrity and boundary sharpness in complex indoor environments.

Problem

Research questions and friction points this paper is trying to address.

3D semantic occupancy prediction

monocular depth ambiguity

feature bleeding

cold start instability

embodied AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft-Gating Feature Lifting

Ray-Constrained Geometric Refinement

Temporal Fusion Cold Start