MAIL++: Multi-Modal Bi-directional Agent Layer for Vision-Language Models

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the challenges of high computational cost and overfitting under data scarcity when fine-tuning large-scale vision-language models for downstream tasks, as well as the limited expressiveness and indirect cross-modal interaction in existing parameter-efficient fine-tuning (PEFT) methods. The authors propose MAIL and its enhanced variant MAIL++, which introduce lightweight agent layers after core modules of the backbone network, establishing a novel bidirectional meta-bridging mechanism. This enables fine-grained, structurally aligned cross-modal coupling. During inference, the adapters are efficiently merged back into the backbone via reparameterization to preserve computational efficiency. Extensive experiments demonstrate that the proposed approach significantly outperforms current PEFT methods on few-shot image classification and cross-domain retrieval tasks, achieving superior performance while maintaining low computational overhead.

📝 Abstract

Adapting large vision-language models (VLMs) such as CLIP to downstream tasks remains challenging, as full fine-tuning is computationally prohibitive and prone to overfitting in low-data regimes. Parameter-efficient fine-tuning (PEFT) alleviates these issues with lightweight prompt- or adapter-based modules, and cross-modal coupling has proven especially effective by strengthening interactions between vision and language. However, existing coupling mechanisms predominantly rely on external auxiliary modules, leading to indirect, coarse-grained interactions that are structurally decoupled from the original VLM and thus limit representational expressiveness. In this paper, we propose Multi-Modal Interactive Agent Layer (MAIL), a PEFT paradigm that embeds cross-modal coupling directly into the intrinsic computation modules of VLMs. MAIL freezes the backbone and inserts lightweight agent layers after core modules, such as LayerNorm, to approximate the parameter updates induced by full fine-tuning. To couple visual and textual streams at this level, we introduce a bottleneck-based text-to-image bridge that jointly optimizes paired agent layers across modalities, coordinating the adaptation of corresponding computation modules. We further present MAIL++, which enables bidirectional cross-modal exchange through a meta agent layer, a meta-text bridge, and a meta-image bridge. At inference time, all agent layers are re-parameterized into the frozen backbone, preserving the original computational efficiency. Extensive experiments on few-shot image classification and few-shot universal cross-domain retrieval demonstrate that MAIL and MAIL++ consistently outperform state-of-the-art PEFT methods.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

parameter-efficient fine-tuning

cross-modal coupling

multimodal interaction

adapter-based tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

parameter-efficient fine-tuning

cross-modal coupling

vision-language models