Faster VGGT with Block-Sparse Global Attention

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Transformer-based multi-view reconstruction models (e.g., VGGT, π³) suffer from inference bottlenecks on large-scale image sets due to the quadratic computational complexity O(N²) of global self-attention. Method: We propose a plug-and-play block-sparse global attention mechanism that requires no retraining. Grounded in the empirical observation that cross-view matching regions concentrate probability mass in only a few critical blocks, we design a structured block-sparse pattern and integrate highly optimized sparse kernel functions. The method is fully compatible with existing architectures—preserving model structure and training procedures unchanged. Contribution/Results: Our approach achieves up to 4× speedup in inference time while maintaining reconstruction accuracy. Extensive evaluation on mainstream multi-view benchmarks—including ScanNet and DTU—demonstrates its effectiveness, robustness, and scalability.

Technology Category

Application Category

📝 Abstract
Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT and $π^3$ have achieved impressive results with simple architectures, yet they face an inherent runtime bottleneck, due to the quadratic complexity of the global attention layers, that limits the scalability to large image sets. In this paper, we empirically analyze the global attention matrix of these models and observe that probability mass concentrates on a small subset of patch-patch interactions that correspond to cross-view geometric matches. Motivated by the structured attention and inspired by recent advancement in large language models, we propose a replacement for the dense global attention operation based on highly optimized block-sparse kernels, yielding up to $4 imes$ faster inference with comparable task performance. Our retrofit requires no retraining of the backbone, extends to both VGGT and $π^3$, and supports large image collections. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Addresses quadratic complexity bottleneck in transformer-based multi-view reconstruction
Replaces dense global attention with block-sparse kernels for efficiency
Enables scalable processing of large image collections without retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Block-sparse kernels replace dense attention
No retraining required for backbone models
Supports large image collections efficiently
🔎 Similar Papers
No similar papers found.
C
Chung-Shien Brian Wang
Computer Vision Group, RWTH Aachen University
C
Christian Schmidt
Computer Vision Group, RWTH Aachen University
J
Jens Piekenbrinck
Computer Vision Group, RWTH Aachen University
Bastian Leibe
Bastian Leibe
Professor for Computer Vision, RWTH Aachen University
Computer VisionObject RecognitionTrackingScene Understanding