CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Visual language models (VLMs) suffer from high inference overhead, and existing approaches optimize token sparsity and neuron sparsity independently—ignoring their underlying semantic and computational coupling. Method: This work introduces a co-adaptive sparsity paradigm, integrating dynamic token pruning, learnable neuron importance gating, joint sparse training, and hardware-aware scheduling into an end-to-end sparse inference framework. Contribution/Results: We fundamentally challenge the conventional assumption of sparsity orthogonality by establishing a principled token–neuron co-matching mechanism. Extensive experiments across 10 image understanding tasks and three hardware platforms—including NVIDIA Titan Xp—demonstrate state-of-the-art performance: 5× reduction in FLOPs, 10× acceleration in end-to-end latency, and maintained accuracy, thereby achieving unprecedented efficiency–accuracy trade-offs.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency. Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered? In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency. Through theoretical analysis and efficiency evaluations, we demonstrate that the proposed method surpasses state-of-the-art baselines on ten image understanding tasks and three hardware devices. Notably, on the NVIDIA Titan Xp, it achieved 5x FLOPs reduction and a 10x overall speedup. Code is released at https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main.

Problem

Research questions and friction points this paper is trying to address.

Investigates interplay between token and neuron sparsity in VLMs

Proposes co-adaptive framework to enhance inference efficiency

Achieves significant speedup and FLOPs reduction on image tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Co-adaptive sparse inference framework

Synergy between token and neuron sparsity

Core Neurons and Core Tokens matching

🔎 Similar Papers

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference