🤖 AI Summary
This work addresses the redundancy introduced by excessive visual tokens in vision-language models, which significantly increases inference overhead. The authors formulate token pruning as a bandwidth-constrained information transmission problem and propose an efficient, annotation-free pruning method that requires no auxiliary objectives. By employing lightweight Scorer and Denoiser modules, the approach learns to predict token importance using only the standard language modeling loss. A variance-preserving noisy gating mechanism ensures full gradient flow during training while enabling hard top-K selection at inference time. The method is architecture-agnostic and achieves strong transferability, retaining 96.5% of the original accuracy across ten vision-language benchmarks, accelerating LLM prefilling by 2.85×, and adding merely 0.69 milliseconds of latency.
📝 Abstract
Visual tokens dominate inference cost in vision-language models (VLMs), yet many carry redundant information. Existing pruning methods alleviate this but typically rely on attention magnitude or similarity scores. We reformulate visual token pruning as capacity constrained communication: given a fixed budget K, the model must allocate limited bandwidth to maximally preserve visual information. We propose AutoSelect, which attaches a lightweight Scorer and Denoiser to a frozen VLM and trains with only the standard next token prediction loss, without auxiliary objectives or extra annotations. During training, a variance preserving noise gate modulates each token's information flow according to its predicted importance so that gradients propagate through all tokens; a diagonal attention Denoiser then recovers the perturbed representations. At inference, only the Scorer and a hard top-K selection remain, adding negligible latency. On ten VLM benchmarks, AutoSelect retains 96.5% of full model accuracy while accelerating LLM prefill by 2.85x with only 0.69 ms overhead, and transfers to different VLM backbones without architecture-specific tuning. Code is available at https://github.com/MedHK23/AutoSelect.