Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing monocular 3D occupancy prediction methods underutilize geometric priors and struggle to model internal voxel structures, limiting both performance and generalization. This work proposes GPOcc, a novel framework that, for the first time, extends generic vision-based geometric priors along camera rays into Gaussian primitives within voxels, yielding a probabilistic and sparse occupancy representation. Furthermore, it introduces a training-free, streaming multi-frame fusion mechanism. The proposed approach achieves substantial performance gains on Occ-ScanNet and EmbodiedOcc-ScanNet: under monocular settings, it improves mIoU by 9.99, and by 11.79 in the streaming setting. With the same depth prior, GPOcc also accelerates inference by 2.65× while boosting mIoU by 6.73.

Technology Category

Application Category

📝 Abstract

Accurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction. We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric samples, which are represented as Gaussian primitives for probabilistic occupancy inference. To handle streaming input, we further design a training-free incremental update strategy that fuses per-frame Gaussians into a unified global representation. Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains: GPOcc improves mIoU by +9.99 in the monocular setting and +11.79 in the streaming setting over prior state of the art. Under the same depth prior, it achieves +6.73 mIoU while running 2.65$\times$ faster. These results highlight that GPOcc leverages geometry priors more effectively and efficiently. Code will be released at https://github.com/JuIvyy/GPOcc.

Problem

Research questions and friction points this paper is trying to address.

occupancy prediction

visual geometry priors

3D scene understanding

monocular 3D reconstruction

volumetric representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian primitives

visual geometry priors

monocular occupancy prediction