VGGT-X: When VGGT Meets Dense Novel View Synthesis

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address the reliance of dense novel view synthesis (NVS) on Structure-from-Motion (SfM) pipelines (e.g., COLMAP) for accurate camera poses and geometry—and the memory explosion and poor output fidelity plaguing 3D foundation models (3DFMs)—this paper proposes VGGT-X: an SfM-free, memory-efficient, and robust end-to-end framework for dense NVS. Its core innovations include a lightweight VGGT architecture, an adaptive global geometric alignment mechanism, and a stabilization strategy tailored for 3D Gaussian Splatting (3DGS) training. VGGT-X achieves high-fidelity dense view synthesis for the first time at the scale of thousands of input images, while simultaneously estimating precise camera poses. In COLMAP-free settings, it attains state-of-the-art rendering quality and pose estimation accuracy, significantly narrowing the fidelity gap with traditional SfM+NeRF paradigms.

Technology Category

Application Category

📝 Abstract

We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS). Despite significant progress in Novel View Synthesis powered by NeRF and 3DGS, current approaches remain reliant on accurate 3D attributes (e.g., camera poses and point clouds) acquired from Structure-from-Motion (SfM), which is often slow and fragile in low-texture or low-overlap captures. Recent 3DFMs showcase orders of magnitude speedup over the traditional pipeline and great potential for online NVS. But most of the validation and conclusions are confined to sparse-view settings. Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade initialization-sensitive 3D training. To address these barriers, we introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. Extensive experiments show that these measures substantially close the fidelity gap with COLMAP-initialized pipelines, achieving state-of-the-art results in dense COLMAP-free NVS and pose estimation. Additionally, we analyze the causes of remaining gaps with COLMAP-initialized rendering, providing insights for the future development of 3D foundation models and dense NVS. Our project page is available at https://dekuliutesla.github.io/vggt-x.github.io/

Problem

Research questions and friction points this paper is trying to address.

Applying 3D foundation models to dense novel view synthesis

Overcoming VRAM burden and imperfect outputs in dense views

Achieving COLMAP-free dense novel view synthesis and pose estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-efficient VGGT scales to 1,000+ images

Adaptive global alignment enhances VGGT output quality

Robust 3DGS training practices improve dense view synthesis

🔎 Similar Papers

No similar papers found.