๐ค AI Summary
This work addresses the high computational cost of vision-language models caused by excessive visual tokens. Existing pruning methods rely on continuous gradient approximations, which often converge to suboptimal solutions under aggressive compression. To overcome this limitation, the authors formulate token pruning as a Markov decision process and propose a reinforcement learningโbased framework for group relative importance pruning. Their approach employs supervised warm-up followed by a lightweight agent that directly explores the discrete pruning space, eliminating the need for continuous relaxation. A budget-aware scoring mechanism enables flexible compression ratios without retraining. The proposed GRPO policy optimization paradigm consistently outperforms heuristic and supervised baselines across multimodal benchmarks, achieving superior Pareto efficiency and up to 15% inference acceleration with no loss in accuracy.
๐ Abstract
In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous approximations frequently trap the optimization in sub-optimal local minima, especially under aggressive compression budgets. To overcome this fundamental bottleneck, we propose GRIP-VLM, a Group-Relative Importance Pruning framework driven by Reinforcement Learning. Rather than relying on smooth-gradient assumptions, GRIP-VLM formulates pruning as a Markov Decision Process, employing a Group Relative Policy Optimization (GRPO) paradigm anchored by supervised warm-up to directly explore the discrete selection space. Integrated with a budget-aware scorer, our lightweight agent dynamically evaluates per-token importance and adapts to arbitrary compression ratios without retraining. Extensive experiments across diverse multimodal benchmarks demonstrate that GRIP-VLM consistently outperforms heuristic and supervised-learning baselines, achieving a superior Pareto frontier and delivering up to a 15\% inference speedup at equal accuracy.