🤖 AI Summary
This work addresses the prohibitive quadratic computational complexity of dense attention mechanisms in existing feed-forward 3D reconstruction models, which severely limits inference efficiency. To overcome this, we propose Speed3R, the first end-to-end feed-forward framework that integrates sparse keypoint reasoning into 3D reconstruction. Speed3R employs a dual-branch attention mechanism: a compression branch generates a coarse contextual prior to guide a selection branch that performs fine-grained attention computation only on the most informative image tokens. Coupled with VGGT and π³ backbones, Speed3R achieves a 12.4× speedup in inference on standard benchmarks, substantially reducing computational overhead while maintaining high-quality geometric reconstruction with only a controlled and minimal loss in accuracy.
📝 Abstract
While recent feed-forward 3D reconstruction models accelerate 3D reconstruction by jointly inferring dense geometry and camera poses in a single pass, their reliance on dense attention imposes a quadratic complexity, creating a prohibitive computational bottleneck that severely limits inference speed. To resolve this, we introduce Speed3R, an end-to-end trainable model inspired by the core principle of Structure-from-Motion: that a sparse set of keypoints is sufficient for robust pose estimation. Speed3R features a dual-branch attention mechanism where a compression branch creates a coarse contextual prior to guide a selection branch, which performs fine-grained attention only on the most informative image tokens. This strategy mimics the efficiency of traditional keypoint matching, achieving a remarkable 12.4x inference speedup on 1000-view sequences, while introducing a minimal, controlled trade-off in geometric accuracy. Validated on standard benchmarks with both VGGT and $\pi^3$ backbones, our method delivers high-quality reconstructions at a fraction of computational cost, paving the way for efficient large-scale scene modeling.