Efficient Multi-modal Long Context Learning for Training-free Adaptation

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional multimodal large language models (MLLMs) rely on fine-tuning for adapting to new tasks—resulting in low efficiency and poor flexibility. This paper proposes EMLoC, a training-free multimodal task adaptation method that enables efficient, flexible, and scalable cross-task transfer by embedding demonstration examples into inputs and compressing long contexts. Its core innovation lies in the first joint design of block-level context compression and layer-adaptive token pruning, optimized under Jensen–Shannon divergence constraints to yield compact, task-aware multimodal representations—significantly alleviating computational and memory bottlenecks in long-context inference. Evaluated across multiple vision-language benchmarks, EMLoC matches or surpasses state-of-the-art long-context baselines while drastically reducing inference overhead. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Traditional approaches to adapting multi-modal large language models (MLLMs) to new tasks have relied heavily on fine-tuning. This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC), a novel training-free alternative that embeds demonstration examples directly into the model input. EMLoC offers a more efficient, flexible, and scalable solution for task adaptation. Because extremely lengthy inputs introduce prohibitive computational and memory overhead, EMLoC contributes a chunk-wise compression mechanism combined with layer-wise adaptive pruning. It condenses long-context multimodal inputs into compact, task-specific memory representations. By adaptively pruning tokens at each layer under a Jensen-Shannon divergence constraint, our method achieves a dramatic reduction in inference complexity without sacrificing performance. This approach is the first to seamlessly integrate compression and pruning techniques for multi-modal long-context learning, offering a scalable and efficient solution for real-world applications. Extensive experiments on diverse vision-language benchmarks demonstrate that EMLoC achieves performance on par with or superior to naive long-context approaches. Our results highlight the potential of EMLoC as a groundbreaking framework for efficient and flexible adaptation of multi-modal models in resource-constrained environments. Codes are publicly available at https://github.com/Zehong-Ma/EMLoC.
Problem

Research questions and friction points this paper is trying to address.

Training-free adaptation for multi-modal large language models
Efficient compression of long-context multimodal inputs
Scalable solution for resource-constrained environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free adaptation via input embedding
Chunk-wise compression with adaptive pruning
Compact memory representations for long-context
🔎 Similar Papers
No similar papers found.
Z
Zehong Ma
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Shiliang Zhang
Shiliang Zhang
Department of Computer Science, School of EECS, Peking University
Multimedia Information RetrievalMultimedia SystemsVisual Search
Longhui Wei
Longhui Wei
Senior Researcher, Huawei
multimodal&Visual pre-trainingVLMMultimodal Generation
Q
Qi Tian
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)