PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

189K/year
🤖 AI Summary
This work addresses the high computational complexity of existing Visual Geometry Transformers (VGGT), whose alternating attention mechanism scales quadratically with the number of tokens, hindering efficient processing of long video sequences. To overcome this limitation, the authors propose a pre-attention token pruning framework that introduces a lightweight token scorer before the alternating attention blocks. This scorer leverages DINO features to assess token importance and optimizes the pruning strategy jointly with the downstream task loss. Additionally, a feature-guided reconstruction module is designed to recover spatial structure lost during pruning. By adaptively pruning and merging tokens prior to attention computation—combined with feature reconstruction—the method achieves a 5.1× reduction in inference latency over the original VGGT (N=300) on ScanNet-50, outperforms LiteVGGT (N=1000) by 1.47×, and maintains competitive reconstruction accuracy, establishing a new state-of-the-art trade-off between latency and quality.
📝 Abstract
Visual Geometry Transformer (VGGT) is a strong feed-forward model for multiple 3D tasks, but its Alternating-Attention (AA) stack scales quadratically in the total token count, making long clips expensive. Existing token-reduction accelerators operate inside AA, leaving the patch grid that enters AA uncompressed. We introduce PaceVGGT, a pre-AA token pruning framework that prunes DINO patch tokens before the first AA block of a frozen VGGT. PaceVGGT trains a lightweight Token Scorer that estimates per-token importance from DINO features. The scorer is first distilled against an AA-internal attention target from the unpruned backbone, then refined under downstream camera, depth, and point-map losses. A per-frame keep budget fixes the backbone-visible sequence length, while an importance-adaptive merge/prune assignment preserves residual content from high-saliency frames under a fixed total merge budget. A Feature-guided Restoration module reconstructs the dense spatial grid required by the prediction heads. On ScanNet-50 and 7-Scenes, PaceVGGT remains on the reconstruction quality--latency frontier while reducing inference latency. On ScanNet-50, it reduces latency by \(5.1\times\) over unmodified VGGT at \(N=300\) and \(1.47\times\) over LiteVGGT at \(N=1000\). These results identify pre-AA pruning as a viable acceleration route for frozen VGGT-style geometry transformers.
Problem

Research questions and friction points this paper is trying to address.

Visual Geometry Transformer
token pruning
Alternating-Attention
inference latency
3D reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

token pruning
Visual Geometry Transformer
Alternating-Attention
pre-AA acceleration
feature-guided restoration
🔎 Similar Papers
No similar papers found.