VGGT-Occ: Geometry-Grounded and Density-Aware Gated Fusion for 3D Occupancy Prediction

📅 2026-05-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
This work addresses the physical inconsistency in existing 3D semantic occupancy prediction methods, which often neglect camera geometric constraints during 2D-to-3D feature lifting. To resolve this, the authors propose VGGT-Occ, a novel framework that integrates geometric constraints throughout the entire feature lifting and fusion pipeline. Specifically, it introduces projection-aware deformable attention with Jacobian-based bias to suppress unreliable observations, employs a view-quality semantic gate for geometry-aware cross-view fusion, and designs a coarse-to-fine gated decoder coupled with a density-aware computational allocation strategy. Evaluated on SurroundOcc-nuScenes, VGGT-Occ achieves state-of-the-art performance with only 41M parameters, attaining 33.64% IoU and 21.43% mIoU (T=2), significantly outperforming current approaches.
📝 Abstract
3D semantic occupancy prediction requires accurate 2D-to-3D feature lifting, yet current methods restrict camera geometry to initial projections. Subsequent operations like offset learning, attention weighting, and cross-camera aggregation remain geometry-agnostic, ignoring essential physical constraints. We propose VGGT-Occ, a framework that embeds geometric tokens throughout the entire pipeline. We introduce Projection-Aware Deformable Attention (PA-DA) to inject geometry into all attention stages. PA-DA projects 3D offsets back to image planes and leverages the projection Jacobian as an additive bias to suppress unreliable observations. Features are then integrated through a view-quality semantic gate for cross-view consistency. To optimize both efficiency and performance, we employ a sequential coarse-to-fine decoder with gated fusion, where low-resolution features are refined into higher resolutions, allocating computation by information density while substantially reducing decoder cost. Extensive evaluations demonstrate the effectiveness and accuracy of our approach. On SurroundOcc-nuScenes, VGGT-Occ achieves 33.00\% IoU and 21.08\% mIoU ($T{=}1$), and 33.64\% IoU and 21.43\% mIoU with $T{=}2$ inference, outperforming existing methods, with only ${\sim}41$M trainable parameters in the occupancy head. Code will be released publicly.
Problem

Research questions and friction points this paper is trying to address.

3D occupancy prediction
geometry-aware lifting
camera geometry
feature fusion
semantic consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

geometry-aware attention
3D occupancy prediction
projection-aware deformable attention
gated fusion
coarse-to-fine decoding