🤖 AI Summary
Existing feedforward 3D reconstruction models struggle to efficiently process long videos due to the quadratic complexity of global attention, and current token compression methods fail to account for the functional disparity between query and key-value tokens. This work proposes a training-free, plug-and-play acceleration framework that, for the first time, reveals the distinct compressibility characteristics: query tokens are highly sensitive to compression, whereas key-value tokens can be substantially pruned. Leveraging this insight, we introduce a grouped heterogeneous compression strategy—merging tokens within query groups while applying lightweight pruning to key-value tokens—and enable cross-layer adaptive adjustment of compression ratios. Our approach achieves up to 28× speedup on 1000-frame inputs, is compatible with models such as VGGT, π³, and Depth-Anything-3, and maintains competitive reconstruction quality.
📝 Abstract
Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quadratic cost of global attention layers. Recent token-merging methods accelerate these models by compressing the token sequence within the global attention layers, but they apply a uniform reduction to query tokens and key-value tokens, ignoring their functionally distinct roles in 3D reconstruction. In this work, we identify a key property of feed-forward 3D reconstruction models: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Guided by this insight, we propose Spark3R, a training-free acceleration framework that decouples the compression of query tokens and key-value tokens by assigning distinct reduction factors, with intra-group token merging applied to query tokens and lightweight token pruning to key-value tokens. Additionally, Spark3R adaptively adjusts the key-value reduction factor across layers, further improving the quality-efficiency trade-off. As a plug-and-play framework requiring no retraining, Spark3R integrates directly into multiple pretrained feed-forward 3D reconstruction models, including VGGT, $π^3$, and Depth-Anything-3, and achieves up to $28\times$ speedup on 1,000-frame inputs while maintaining competitive reconstruction quality.