VG3T: Visual Geometry Grounded Gaussian Transformer

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address pervasive fragmentation and semantic inconsistency in coherent 3D scene reconstruction from multi-view images, this paper proposes an end-to-end multi-view joint Gaussian prediction framework. Our method jointly optimizes 3D Gaussian primitives across all input views, departing from conventional single-view processing pipelines. Key contributions include: (1) the first multi-view collaborative 3D Gaussian voxelization paradigm; (2) a mesh-based sampling strategy coupled with learnable positional refinement to correct distance-density biases induced by pixel-aligned initialization; and (3) integration of a multi-view Transformer with semantic occupancy prediction for unified geometric-semantic modeling. Evaluated on nuScenes, our approach achieves a 1.7% absolute improvement in mIoU while reducing the number of Gaussian primitives by 46%, demonstrating significant gains in both reconstruction fidelity and computational efficiency.

Technology Category

Application Category

📝 Abstract
Generating a coherent 3D scene representation from multi-view images is a fundamental yet challenging task. Existing methods often struggle with multi-view fusion, leading to fragmented 3D representations and sub-optimal performance. To address this, we introduce VG3T, a novel multi-view feed-forward network that predicts a 3D semantic occupancy via a 3D Gaussian representation. Unlike prior methods that infer Gaussians from single-view images, our model directly predicts a set of semantically attributed Gaussians in a joint, multi-view fashion. This novel approach overcomes the fragmentation and inconsistency inherent in view-by-view processing, offering a unified paradigm to represent both geometry and semantics. We also introduce two key components, Grid-Based Sampling and Positional Refinement, to mitigate the distance-dependent density bias common in pixel-aligned Gaussian initialization methods. Our VG3T shows a notable 1.7%p improvement in mIoU while using 46% fewer primitives than the previous state-of-the-art on the nuScenes benchmark, highlighting its superior efficiency and performance.
Problem

Research questions and friction points this paper is trying to address.

Generates coherent 3D scene from multi-view images
Overcomes fragmentation in multi-view fusion for 3D representation
Mitigates density bias in Gaussian initialization methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view feed-forward network predicts 3D semantic occupancy
Joint multi-view prediction of semantically attributed Gaussians
Grid-based sampling and positional refinement mitigate density bias
🔎 Similar Papers
No similar papers found.
J
Junho Kim
School of Electrical Engineering, Kookmin University, Seoul 02707, South Korea
Seongwon Lee
Seongwon Lee
University of Illinois, Urbana-Champaign
RoboticsTask and Motion PlanningArtificial IntelligenceControl System