🤖 AI Summary
Transformer-based multi-view reconstruction models (e.g., VGGT, π³) suffer from inference bottlenecks on large-scale image sets due to the quadratic computational complexity O(N²) of global self-attention. Method: We propose a plug-and-play block-sparse global attention mechanism that requires no retraining. Grounded in the empirical observation that cross-view matching regions concentrate probability mass in only a few critical blocks, we design a structured block-sparse pattern and integrate highly optimized sparse kernel functions. The method is fully compatible with existing architectures—preserving model structure and training procedures unchanged. Contribution/Results: Our approach achieves up to 4× speedup in inference time while maintaining reconstruction accuracy. Extensive evaluation on mainstream multi-view benchmarks—including ScanNet and DTU—demonstrates its effectiveness, robustness, and scalability.
📝 Abstract
Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT and $π^3$ have achieved impressive results with simple architectures, yet they face an inherent runtime bottleneck, due to the quadratic complexity of the global attention layers, that limits the scalability to large image sets. In this paper, we empirically analyze the global attention matrix of these models and observe that probability mass concentrates on a small subset of patch-patch interactions that correspond to cross-view geometric matches. Motivated by the structured attention and inspired by recent advancement in large language models, we propose a replacement for the dense global attention operation based on highly optimized block-sparse kernels, yielding up to $4 imes$ faster inference with comparable task performance. Our retrofit requires no retraining of the backbone, extends to both VGGT and $π^3$, and supports large image collections. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate the effectiveness of our approach.