GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

๐Ÿ“… 2026-05-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

209K/year
๐Ÿค– AI Summary
This work addresses the high computational cost of vision-language models caused by excessive visual tokens. Existing pruning methods rely on continuous gradient approximations, which often converge to suboptimal solutions under aggressive compression. To overcome this limitation, the authors formulate token pruning as a Markov decision process and propose a reinforcement learningโ€“based framework for group relative importance pruning. Their approach employs supervised warm-up followed by a lightweight agent that directly explores the discrete pruning space, eliminating the need for continuous relaxation. A budget-aware scoring mechanism enables flexible compression ratios without retraining. The proposed GRPO policy optimization paradigm consistently outperforms heuristic and supervised baselines across multimodal benchmarks, achieving superior Pareto efficiency and up to 15% inference acceleration with no loss in accuracy.
๐Ÿ“ Abstract
In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous approximations frequently trap the optimization in sub-optimal local minima, especially under aggressive compression budgets. To overcome this fundamental bottleneck, we propose GRIP-VLM, a Group-Relative Importance Pruning framework driven by Reinforcement Learning. Rather than relying on smooth-gradient assumptions, GRIP-VLM formulates pruning as a Markov Decision Process, employing a Group Relative Policy Optimization (GRPO) paradigm anchored by supervised warm-up to directly explore the discrete selection space. Integrated with a budget-aware scorer, our lightweight agent dynamically evaluates per-token importance and adapts to arbitrary compression ratios without retraining. Extensive experiments across diverse multimodal benchmarks demonstrate that GRIP-VLM consistently outperforms heuristic and supervised-learning baselines, achieving a superior Pareto frontier and delivering up to a 15\% inference speedup at equal accuracy.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Token Pruning
Discrete Optimization
Computational Overhead
Model Compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models
Token Pruning
Reinforcement Learning
Discrete Optimization
Model Compression
๐Ÿ”Ž Similar Papers
No similar papers found.
M
Mingzhe Huang
Institute for AI Industry Research (AIR), Tsinghua University
Weijun Wang
Weijun Wang
Tsinghua University
LLM Serving SystemEdge AIVideo Analytics System
X
Xin Ding
Institute for AI Industry Research (AIR), Tsinghua University
L
Liang Mi
Institute for AI Industry Research (AIR), Tsinghua University
Hao Wen
Hao Wen
Institute for AI Industry Research (AIR), Tsinghua University
Mobile ComputingAIoTArtificial IntelligenceLanguage Agent
Yuanchun Li
Yuanchun Li
Institute for AI Industry Research (AIR), Tsinghua University
mobile computingartificial intelligence
L
Lichen Pang
Juhaokan Technology Co., Ltd
S
Shansong Yang
Juhaokan Technology Co., Ltd
Yunxin Liu
Yunxin Liu
IEEE Fellow, Guoqiang Professor, Institute for AI Industry Research (AIR), Tsinghua University
Mobile ComputingEdge ComputingAIoTSystemNetworking
T
Ting Cao
Institute for AI Industry Research (AIR), Tsinghua University