🤖 AI Summary
This work addresses the high computational cost of multimodal large language models caused by redundant visual tokens, a challenge exacerbated by existing pruning methods that overlook the structural distribution of visual representations. The authors propose the first training-free pruning framework that formulates visual token pruning as an optimal transport problem, aligning the distributions of full and pruned token sets to preserve both local diversity and global representativeness. By leveraging the 2-Wasserstein distance, they construct a submodular objective function amenable to optimization and theoretically establish its monotonicity and submodularity, guaranteeing efficient and stable distribution alignment. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art approaches across multiple benchmarks, achieving superior performance-efficiency trade-offs while maintaining high semantic fidelity.
📝 Abstract
Multi-modal large language models (MLLMs) achieve strong visual-language reasoning but suffer from high inference cost due to redundant visual tokens. Recent work explores visual token pruning to accelerate inference, while existing pruning methods overlook the underlying distributional structure of visual representations. We propose OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT). By minimizing the 2-Wasserstein distance between the full and pruned token distributions, OTPrune preserves both local diversity and global representativeness while reducing inference cost. Moreover, we derive a tractable submodular objective that enables efficient optimization, and theoretically prove its monotonicity and submodularity, providing a principled foundation for stable and efficient pruning. We further provide a comprehensive analysis that explains how distributional alignment contributes to stable and semantically faithful pruning. Comprehensive experiments on wider benchmarks demonstrate that OTPrune achieves superior performance-efficiency tradeoffs compared to state-of-the-art methods. The code is available at https://github.com/xiwenc1/OTPrune.