Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

Streaming vision Transformers (e.g., StreamVGGT) suffer from unbounded growth of key-value (KV) memory and poor scalability in 3D perception. To address this, we propose a training-free, inference-time dynamic token pruning mechanism: tokens are evaluated in real time based on attention entropy and positional importance heuristics to quantify informativeness, enabling on-the-fly removal of redundant tokens and enforcing bounded KV memory usage. This is the first method to achieve training-free dynamic compression for streaming vision Transformers. It significantly improves memory efficiency—reducing peak memory by over 50% with negligible accuracy degradation—and even surpasses the original model’s reconstruction accuracy under stringent memory constraints. Extensive experiments across multiple 3D vision tasks demonstrate its effectiveness, enabling higher frame-rate sampling and longer sequence modeling, thereby substantially enhancing the practicality of streaming inference.

Technology Category

Application Category

📝 Abstract

Streaming visual transformers like StreamVGGT achieve strong 3D perception but suffer from unbounded growth of key value (KV) memory, which limits scalability. We propose a training-free, inference-time token eviction policy that bounds memory by discarding redundant tokens while keeping the most informative ones. Our method uses significantly less memory with little to no drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from 18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under strict memory budgets, eviction enables denser frame sampling, which improves reconstruction accuracy compared to the baseline. Experiments across video depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and camera pose estimation (Sintel, TUM-dynamics) show that our approach closely matches StreamVGGT at a fraction of the memory and makes long-horizon streaming inference more practical.

Problem

Research questions and friction points this paper is trying to address.

Bounding KV memory growth in streaming visual transformers

Reducing memory usage without accuracy degradation

Enabling practical long-horizon streaming 3D perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free token eviction policy

Bounding memory by discarding redundant tokens

Maintaining accuracy with significantly less memory

🔎 Similar Papers

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments