PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the high computational complexity of existing Visual Geometry Transformers (VGGT), whose alternating attention mechanism scales quadratically with the number of tokens, hindering efficient processing of long video sequences. To overcome this limitation, the authors propose a pre-attention token pruning framework that introduces a lightweight token scorer before the alternating attention blocks. This scorer leverages DINO features to assess token importance and optimizes the pruning strategy jointly with the downstream task loss. Additionally, a feature-guided reconstruction module is designed to recover spatial structure lost during pruning. By adaptively pruning and merging tokens prior to attention computation—combined with feature reconstruction—the method achieves a 5.1× reduction in inference latency over the original VGGT (N=300) on ScanNet-50, outperforms LiteVGGT (N=1000) by 1.47×, and maintains competitive reconstruction accuracy, establishing a new state-of-the-art trade-off between latency and quality.

📝 Abstract

Visual Geometry Transformer (VGGT) is a strong feed-forward model for multiple 3D tasks, but its Alternating-Attention (AA) stack scales quadratically in the total token count, making long clips expensive. Existing token-reduction accelerators operate inside AA, leaving the patch grid that enters AA uncompressed. We introduce PaceVGGT, a pre-AA token pruning framework that prunes DINO patch tokens before the first AA block of a frozen VGGT. PaceVGGT trains a lightweight Token Scorer that estimates per-token importance from DINO features. The scorer is first distilled against an AA-internal attention target from the unpruned backbone, then refined under downstream camera, depth, and point-map losses. A per-frame keep budget fixes the backbone-visible sequence length, while an importance-adaptive merge/prune assignment preserves residual content from high-saliency frames under a fixed total merge budget. A Feature-guided Restoration module reconstructs the dense spatial grid required by the prediction heads. On ScanNet-50 and 7-Scenes, PaceVGGT remains on the reconstruction quality--latency frontier while reducing inference latency. On ScanNet-50, it reduces latency by $5.1\times$ over unmodified VGGT at $N=300$ and $1.47\times$ over LiteVGGT at $N=1000$. These results identify pre-AA pruning as a viable acceleration route for frozen VGGT-style geometry transformers.

Problem

Research questions and friction points this paper is trying to address.

Visual Geometry Transformer

token pruning

Alternating-Attention

inference latency

3D reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

token pruning

Visual Geometry Transformer

Alternating-Attention