Fine-grained Token Allocation Via Operation Pruning for Efficient MLLMs

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address structural redundancy across modules in multimodal large language models (MLLMs)—causing redundant token processing and coarse-grained computational resource allocation—this paper proposes the first operation-level fine-grained pruning framework. Our method introduces depth-first pruning and additive approximation optimization to compress the pruning search space and linearize validation cost; integrates data-driven search guided by output distribution divergence minimization; and employs module-type- and layer-aware operation grouping to enable dynamic, constraint-aware token allocation. Evaluated on six MLLMs and thirteen benchmarks, our approach significantly outperforms twelve baselines: for LLaVA-Next-7B, it achieves an 86% reduction in computation, an 83% latency decrease, and only a 1% performance drop—marking the first demonstration of efficient, module-level, operation-granular resource orchestration in MLLM inference.

Technology Category

Application Category

📝 Abstract

Token reduction accelerates Multimodal Large Language Models (MLLMs) by reducing excessive tokens, but overlooks structural redundancy differences, where critical and redundant modules process identical token loads. For fine-grained computation control, we define an ``operation"as the computation for a module to process a group of tokens and introduce the operation pruning framework to enable modules to selectively process tokens. Built on this framework, we propose Depth-wise Operation Pruning (DOP), a data-driven method that searches for strategies to prune redundant operations and save computational budget for critical modules to process more tokens than uniform allocation by minimizing divergence from the original model's output probability distribution on a small validation set while satisfying computational constraints. For efficient optimization, DOP applies depth-wise pruning to reduce policy space and uses an additive approximation to minimize required validation runs. Depth-wise pruning partitions operations by module type and token group, and prunes operations in deeper layers before those in shallower layers within each module-group pair. The additive approximation obtains individual divergences by independently varying each policy parameter, and then sums them to approximate the joint divergence of simultaneously changing all policy parameters, reducing required validation runs from exponential to linear with respect to the number of policy parameters. Comprehensive evaluations show that DOP establishes new state-of-the-art performance across 6 MLLMs and 13 benchmarks against 12 baselines. On LLaVA-Next-7B, DOP achieves 86% TFLOPS reduction and 83% latency reduction on real GPU with only 1% performance loss. Our extensive ablation studies further demonstrate DOP's data and time efficiency as well as strong generalization capabilities.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational redundancy in multimodal large language models

Enables selective token processing through operation pruning framework

Optimizes token allocation while maintaining model performance under constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Operation pruning framework for selective token processing

Depth-wise pruning strategy to reduce computational redundancy

Additive approximation method for efficient optimization

🔎 Similar Papers

No similar papers found.