Dynamic Pyramid Network for Efficient Multimodal Large Language Model

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) face deployment challenges due to the high computational cost of visual encoders; existing visual feature compression methods often degrade fine-grained semantics, leading to significant performance drops—especially on challenging samples. To address this, we propose the Dynamic Pyramid Network (DPN), the first hierarchical, progressive visual compression architecture. DPN integrates input-driven Dynamic Pooling Experts (DPEs) to enable adaptive allocation of computational resources across spatial and semantic scales. By preserving discriminative visual semantics under high compression ratios, DPN ensures robust multimodal inference. Evaluated on LLaVA, DPN reduces average FLOPs by 56% while improving overall accuracy by 0.74%. Crucially, when transferred to LLaVA-HR, DPN maintains consistent performance gains, demonstrating strong generalizability and architectural universality across MLLM backbones.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks, but their expensive computations still limit the real-world application. To address this issue, recent efforts aim to compress the visual features to save the computational costs of MLLMs. However, direct visual compression methods, e.g. efficient projectors, inevitably destroy the visual semantics in MLLM, especially in difficult samples. To overcome this shortcoming, we propose a novel dynamic pyramid network (DPN) for efficient MLLMs. Specifically, DPN formulates MLLM as a hierarchical structure where visual features are gradually compressed with increasing depth. In this case, even with a high compression ratio, fine-grained visual information can still be perceived in shallow layers. To maximize the benefit of DPN, we further propose an innovative Dynamic Pooling Experts (DPE) that can dynamically choose the optimal visual compression rate according to input features. With this design, harder samples will be assigned larger computations, thus preserving the model performance. To validate our approach, we conduct extensive experiments on two popular MLLMs and ten benchmarks. Experimental results show that DPN can save up to 56% average FLOPs on LLaVA while further achieving +0.74% performance gains. Besides, the generalization ability of DPN is also validated on the existing high-resolution MLLM called LLaVA-HR. Our source codes are anonymously released at https://github.com/aihao2000/DPN-LLaVA.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational costs of multimodal large language models
Preserves visual semantics in difficult samples during compression
Dynamically adjusts compression rate based on input complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical visual feature compression in MLLMs
Dynamic Pooling Experts for adaptive compression
Saves 56% FLOPs while boosting performance