🤖 AI Summary
Existing visual token pruning methods in vision-language models rely on handcrafted configurations, struggling to balance computational efficiency and task performance. This work formulates pruning for the first time as a budget-aware Pareto frontier optimization problem, enabling differentiable search via continuous relaxation and straight-through estimators, and automatically learns optimal pruning strategies using an augmented Lagrangian method. The approach introduces a learnable kernel function and a multi-stage progressive pruning mechanism to better align with the hierarchical architecture of vision-language models. Experiments across eight vision-language benchmarks demonstrate that the proposed method effectively approximates the empirical Pareto frontier, significantly outperforms single-layer pruning strategies, and exhibits strong generalization across diverse pruning budgets and model architectures.
📝 Abstract
Visual token pruning methods effectively mitigate the quadratic computational growth caused by processing high-resolution images and video frames in vision-language models (VLMs). However, existing approaches rely on predefined pruning configurations without determining whether they achieve computation-performance optimality. In this work, we introduce , a novel framework that formulates visual token pruning as a Pareto configuration optimization problem to automatically identify optimal configurations. Our approach employs continuous relaxation and straight-through estimators to enable gradient-based search, solved via the Augmented Lagrangian method. Extensive experiments across 8 visual benchmarks demonstrate that effectively approximates the empirical Pareto frontier obtained through grid search and generalizes well across various pruning methods and VLM architectures. Furthermore, through learnable kernel functions, we investigate layer-wise pruning patterns and reveal that multi-step progressive pruning captures VLMs' hierarchical compression structure, achieving superior accuracy-efficiency trade-offs compared to single-layer approaches.