🤖 AI Summary
Visual geometric Transformers suffer from slow inference speeds, hindering their deployment in real-time 3D perception and reconstruction tasks. To address this, we propose a lightweight, training-free acceleration method. Our approach introduces a lightweight confidence predictor that ranks and merges tokens based on uncertainty estimation—replacing conventional similarity-driven token merging—thereby significantly reducing computational overhead while strictly preserving spatial coverage and model performance. The method is fully compatible with existing Transformer architectures and requires no architectural modification or retraining. Evaluated on the VGGT and MapAnything benchmarks, it achieves up to 11.3× and 7.2× inference speedup, respectively. This substantially enhances the practicality of visual geometric Transformers for multi-view understanding and streaming vision applications.
📝 Abstract
We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to $11.3 imes$ and $7.2 imes$ speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.