Efficient Multi-modal Long Context Learning for Training-free Adaptation

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Traditional multimodal large language models (MLLMs) rely on fine-tuning for adapting to new tasks—resulting in low efficiency and poor flexibility. This paper proposes EMLoC, a training-free multimodal task adaptation method that enables efficient, flexible, and scalable cross-task transfer by embedding demonstration examples into inputs and compressing long contexts. Its core innovation lies in the first joint design of block-level context compression and layer-adaptive token pruning, optimized under Jensen–Shannon divergence constraints to yield compact, task-aware multimodal representations—significantly alleviating computational and memory bottlenecks in long-context inference. Evaluated across multiple vision-language benchmarks, EMLoC matches or surpasses state-of-the-art long-context baselines while drastically reducing inference overhead. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Traditional approaches to adapting multi-modal large language models (MLLMs) to new tasks have relied heavily on fine-tuning. This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC), a novel training-free alternative that embeds demonstration examples directly into the model input. EMLoC offers a more efficient, flexible, and scalable solution for task adaptation. Because extremely lengthy inputs introduce prohibitive computational and memory overhead, EMLoC contributes a chunk-wise compression mechanism combined with layer-wise adaptive pruning. It condenses long-context multimodal inputs into compact, task-specific memory representations. By adaptively pruning tokens at each layer under a Jensen-Shannon divergence constraint, our method achieves a dramatic reduction in inference complexity without sacrificing performance. This approach is the first to seamlessly integrate compression and pruning techniques for multi-modal long-context learning, offering a scalable and efficient solution for real-world applications. Extensive experiments on diverse vision-language benchmarks demonstrate that EMLoC achieves performance on par with or superior to naive long-context approaches. Our results highlight the potential of EMLoC as a groundbreaking framework for efficient and flexible adaptation of multi-modal models in resource-constrained environments. Codes are publicly available at https://github.com/Zehong-Ma/EMLoC.

Problem

Research questions and friction points this paper is trying to address.

Training-free adaptation for multi-modal large language models

Efficient compression of long-context multimodal inputs

Scalable solution for resource-constrained environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free adaptation via input embedding

Chunk-wise compression with adaptive pruning

Compact memory representations for long-context

🔎 Similar Papers

No similar papers found.