🤖 AI Summary
Token redundancy in multimodal large language models (MLLMs) severely hinders inference efficiency. Method: This paper proposes a reinforcement learning (RL)-driven post-training method that, for the first time, formulates token sparsity as a learnable RL optimization objective—jointly maximizing inference acceleration (via token reduction) and answer accuracy. Built upon the PPO framework, it employs a dual-reward mechanism (token compression rate + accuracy) without modifying model architecture or introducing auxiliary parameters, ensuring compatibility with mainstream MLLMs such as Qwen2-VL and Qwen2.5-VL. Contribution/Results: Evaluated across 13 image and video benchmarks, the method reduces the average token utilization of Qwen2/2.5-VL from 80% to 25%, achieving substantial inference speedup with only marginal accuracy degradation (<1.2%). The approach is efficient, parameter-light, and plug-and-play.
📝 Abstract
Sparse attention mechanisms aim to reduce computational overhead by selectively processing a subset of salient tokens while preserving model performance. Despite the effectiveness of such designs, how to actively encourage token sparsity of well-posed MLLMs remains under-explored, which fundamentally limits the achievable acceleration effect during inference. In this paper, we propose a simple RL-based post-training method named extbf{ZipR1} that treats the token reduction ratio as the efficiency reward and answer accuracy as the performance reward. In this way, our method can jointly alleviate the computation and memory bottlenecks via directly optimizing the inference-consistent efficiency-performance tradeoff. Experimental results demonstrate that ZipR1 can reduce the token ratio of Qwen2/2.5-VL from 80% to 25% with a minimal accuracy reduction on 13 image and video benchmarks.