HD-VGGT: High-Resolution Visual Geometry Transformer

📅 2026-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of high computational cost and unstable geometry inference in weakly textured regions when performing Transformer-based 3D reconstruction from high-resolution images. To overcome these limitations, the authors propose a dual-branch Transformer architecture: a low-resolution branch produces a globally consistent coarse geometry, while a high-resolution branch recovers fine details through a learnable feature upsampling module. A feature modulation mechanism is further introduced to suppress interference from unreliable regions. This approach achieves, for the first time, efficient processing of high-resolution inputs, significantly reducing computational and memory requirements while enhancing robustness in challenging areas such as blurred or repetitive textures. Under multi-view high-resolution supervision, the method attains state-of-the-art reconstruction accuracy.
📝 Abstract
High-resolution imagery is essential for accurate 3D reconstruction, as many geometric details only emerge at fine spatial scales. Recent feed-forward approaches, such as the Visual Geometry Grounded Transformer (VGGT), have demonstrated the ability to infer scene geometry from large collections of images in a single forward pass. However, scaling these models to high-resolution inputs remains challenging: the number of tokens in transformer architectures grows rapidly with both image resolution and the number of views, leading to prohibitive computational and memory costs. Moreover, we observe that visually ambiguous regions, such as repetitive patterns, weak textures, or specular surfaces, often produce unstable feature tokens that degrade geometric inference, especially at higher resolutions. We introduce HD-VGGT, a dual-branch architecture for efficient and robust high-resolution 3D reconstruction. A low-resolution branch predicts a coarse, globally consistent geometry, while a high-resolution branch refines details via a learned feature upsampling module. To handle unstable tokens, we propose Feature Modulation, which suppresses unreliable features early in the transformer. HD-VGGT leverages high-resolution images and supervision without full-resolution transformer costs, achieving state-of-the-art reconstruction quality.
Problem

Research questions and friction points this paper is trying to address.

high-resolution 3D reconstruction
transformer scalability
visual ambiguity
feature instability
computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

HD-VGGT
high-resolution 3D reconstruction
dual-branch architecture
feature modulation
visual geometry transformer
🔎 Similar Papers
No similar papers found.
Tianrun Chen
Tianrun Chen
Zhejiang University
Computer Vision3D ReconstructionComputational ImagingLarge Vision-Language Model
Y
Yuanqi Hu
KOKONI 3D, Moxin Technology
Y
Yidong Han
Huzhou University
H
Hanjie Xu
KOKONI 3D, Moxin Technology
Deyi Ji
Deyi Ji
Tencent; USTC Ph.D.
Multimodal LLMComputer VisionNLP
Q
Qi Zhu
University of Science and Technology of China
C
Chunan Yu
Nanjing University of Science and Technology
X
Xin Zhang
KOKONI 3D, Moxin Technology
C
Cheng Chen
KOKONI 3D, Moxin Technology
C
Chaotao Ding
KOKONI 3D, Moxin Technology
Y
Ying Zang
Huzhou University
X
Xuanfu Li
Huawei
J
Jin Ma
Huawei
Lanyun Zhu
Lanyun Zhu
NTU, CityUHK, SUTD, BUAA
Multimodal LearningComputer VisionResource-efficient LearningLarge Vision-Language Model