ZipR1: Reinforcing Token Sparsity in MLLMs

📅 2025-04-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Token redundancy in multimodal large language models (MLLMs) severely hinders inference efficiency. Method: This paper proposes a reinforcement learning (RL)-driven post-training method that, for the first time, formulates token sparsity as a learnable RL optimization objective—jointly maximizing inference acceleration (via token reduction) and answer accuracy. Built upon the PPO framework, it employs a dual-reward mechanism (token compression rate + accuracy) without modifying model architecture or introducing auxiliary parameters, ensuring compatibility with mainstream MLLMs such as Qwen2-VL and Qwen2.5-VL. Contribution/Results: Evaluated across 13 image and video benchmarks, the method reduces the average token utilization of Qwen2/2.5-VL from 80% to 25%, achieving substantial inference speedup with only marginal accuracy degradation (<1.2%). The approach is efficient, parameter-light, and plug-and-play.

Technology Category

Application Category

📝 Abstract
Sparse attention mechanisms aim to reduce computational overhead by selectively processing a subset of salient tokens while preserving model performance. Despite the effectiveness of such designs, how to actively encourage token sparsity of well-posed MLLMs remains under-explored, which fundamentally limits the achievable acceleration effect during inference. In this paper, we propose a simple RL-based post-training method named extbf{ZipR1} that treats the token reduction ratio as the efficiency reward and answer accuracy as the performance reward. In this way, our method can jointly alleviate the computation and memory bottlenecks via directly optimizing the inference-consistent efficiency-performance tradeoff. Experimental results demonstrate that ZipR1 can reduce the token ratio of Qwen2/2.5-VL from 80% to 25% with a minimal accuracy reduction on 13 image and video benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Optimizing token sparsity in MLLMs for computational efficiency
Balancing token reduction and model accuracy in inference
Reducing computation and memory bottlenecks in MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

RL-based post-training for token sparsity
Optimizes efficiency-performance tradeoff directly
Reduces token ratio significantly with minimal accuracy loss
🔎 Similar Papers
No similar papers found.