Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Visual geometric Transformers suffer from slow inference speeds, hindering their deployment in real-time 3D perception and reconstruction tasks. To address this, we propose a lightweight, training-free acceleration method. Our approach introduces a lightweight confidence predictor that ranks and merges tokens based on uncertainty estimation—replacing conventional similarity-driven token merging—thereby significantly reducing computational overhead while strictly preserving spatial coverage and model performance. The method is fully compatible with existing Transformer architectures and requires no architectural modification or retraining. Evaluated on the VGGT and MapAnything benchmarks, it achieves up to 11.3× and 7.2× inference speedup, respectively. This substantially enhances the practicality of visual geometric Transformers for multi-view understanding and streaming vision applications.

Technology Category

Application Category

📝 Abstract

We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to $11.3 imes$ and $7.2 imes$ speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.

Problem

Research questions and friction points this paper is trying to address.

Accelerating visual geometric transformers without retraining base models

Reducing computation by selectively merging low-confidence tokens

Maintaining performance while achieving substantial speedup for real-time applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Confidence predictor ranks tokens by uncertainty

Selectively merges low-confidence tokens to reduce computation

Applies to multi-view and streaming visual geometric transformers

🔎 Similar Papers

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval