AVGGT: Rethinking Global Attention for Accelerating VGGT

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multi-view 3D models (e.g., VGGT, π³) suffer from excessive computational overhead due to reliance on global self-attention. Method: We propose a training-free, two-stage sparsification framework. First, through systematic analysis of alternating global-frame attention patterns across layers, we identify that early layers prioritize inter-frame consistency while deeper layers focus on intra-frame detail. Leveraging this insight, we design frame-wise attention replacement and key-value subsampling, combined with diagonal preservation and mean-based padding for efficient sparse computation. Results: Our method achieves 8–10× inference speedup on standard pose estimation and point cloud reconstruction benchmarks, maintaining or slightly improving accuracy, with strong robustness in dense multi-view settings. Contribution: This work is the first to uncover the hierarchical functional roles of global attention in multi-view reasoning and establishes a lightweight, plug-and-play, training-free acceleration paradigm.

Technology Category

Application Category

📝 Abstract
Since DUSt3R, models such as VGGT and $pi^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $pi^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component. We instantiate this strategy on VGGT and $pi^3$ and evaluate across standard pose and point-map benchmarks. Our method achieves up to $8$-$10 imes$ speedup in inference time while matching or slightly improving the accuracy of the original models, and remains robust even in extremely dense multi-view settings where prior sparse-attention baselines fail.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost of global attention in multi-view 3D models
Analyzes role of global attention layers in VGGT and π³ architectures
Accelerates inference while maintaining accuracy in dense multi-view settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Convert early global layers to frame attention
Subsample global attention with K/V patch token subsampling
Achieve 8-10x speedup while maintaining or improving accuracy
🔎 Similar Papers
No similar papers found.
X
Xianbing Sun
Shanghai Jiao Tong University
Z
Zhikai Zhu
Shanghai Jiao Tong University
Z
Zhengyu Lou
Shanghai Jiao Tong University
B
Bo Yang
Ant Group
J
Jinyang Tang
Ant Group
Liqing Zhang
Liqing Zhang
Professor @ Computer Science, Virginia Tech
Bioinformaticsdata analyticsmachine learning
H
He Wang
Ant Group
Jianfu Zhang
Jianfu Zhang
Shanghai Jiao Tong University
Machine LearningComputer Vision