ZipR1: Reinforcing Token Sparsity in MLLMs

📅 2025-04-23

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Token redundancy in multimodal large language models (MLLMs) severely hinders inference efficiency. Method: This paper proposes a reinforcement learning (RL)-driven post-training method that, for the first time, formulates token sparsity as a learnable RL optimization objective—jointly maximizing inference acceleration (via token reduction) and answer accuracy. Built upon the PPO framework, it employs a dual-reward mechanism (token compression rate + accuracy) without modifying model architecture or introducing auxiliary parameters, ensuring compatibility with mainstream MLLMs such as Qwen2-VL and Qwen2.5-VL. Contribution/Results: Evaluated across 13 image and video benchmarks, the method reduces the average token utilization of Qwen2/2.5-VL from 80% to 25%, achieving substantial inference speedup with only marginal accuracy degradation (<1.2%). The approach is efficient, parameter-light, and plug-and-play.

Technology Category

Application Category

📝 Abstract

Sparse attention mechanisms aim to reduce computational overhead by selectively processing a subset of salient tokens while preserving model performance. Despite the effectiveness of such designs, how to actively encourage token sparsity of well-posed MLLMs remains under-explored, which fundamentally limits the achievable acceleration effect during inference. In this paper, we propose a simple RL-based post-training method named extbf{ZipR1} that treats the token reduction ratio as the efficiency reward and answer accuracy as the performance reward. In this way, our method can jointly alleviate the computation and memory bottlenecks via directly optimizing the inference-consistent efficiency-performance tradeoff. Experimental results demonstrate that ZipR1 can reduce the token ratio of Qwen2/2.5-VL from 80% to 25% with a minimal accuracy reduction on 13 image and video benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Optimizing token sparsity in MLLMs for computational efficiency

Balancing token reduction and model accuracy in inference

Reducing computation and memory bottlenecks in MLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

RL-based post-training for token sparsity

Optimizes efficiency-performance tradeoff directly

Reduces token ratio significantly with minimal accuracy loss

🔎 Similar Papers

T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings