RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models

📅 2025-12-07

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Few-shot fine-tuning of vision-language models (VLMs) faces a fundamental trade-off between task-specific adaptation and generalization capability, while existing adapter-based approaches suffer from limited performance. To address this, we propose RMAdapter—a dual-branch architecture comprising a discriminative main branch for task-specific fine-tuning and a reconstruction branch that preserves generic knowledge via latent-space feature reconstruction, jointly regularized by a dynamic consistency constraint. RMAdapter integrates parameter-efficient fine-tuning, hierarchical reconstruction losses, and a shared projection module to balance model compactness and representational capacity. Extensive experiments demonstrate that RMAdapter achieves state-of-the-art performance across cross-category, cross-dataset, and domain generalization benchmarks—without relying on data augmentation or prompt engineering. Its core innovation lies in the first incorporation of explicit reconstruction mechanisms into VLM adapter design, enabling synergistic optimization of knowledge retention and task adaptation.

Technology Category

Application Category

📝 Abstract

Pre-trained Vision-Language Models (VLMs), extit{e.g.} CLIP, have become essential tools in multimodal transfer learning. However, fine-tuning VLMs in few-shot scenarios poses significant challenges in balancing task-specific adaptation and generalization in the obtained model. Meanwhile, current researches have predominantly focused on prompt-based adaptation methods, leaving adapter-based approaches underexplored and revealing notable performance gaps. To address these challenges, we introduce a novel Reconstruction-based Multimodal Adapter (RMAdapter), which leverages a dual-branch architecture. Unlike conventional single-branch adapters, RMAdapter consists of: (1) an adaptation branch that injects task-specific knowledge through parameter-efficient fine-tuning, and (2) a reconstruction branch that preserves general knowledge by reconstructing latent space features back into the original feature space. This design facilitates a dynamic balance between general and task-specific knowledge. Importantly, although RMAdapter introduces an additional reconstruction branch, it is carefully optimized to remain lightweight. By computing reconstruction loss locally at each layer and sharing projection modules, the overall computational overhead is kept minimal. A consistency constraint is also incorporated to better regulate the trade-off between discriminability and generalization. We comprehensively evaluate the effectiveness of RMAdapter on three representative tasks: generalization to new categories, generalization to new target datasets, and domain generalization. Without relying on data augmentation or duplicate prompt designs, our RMAdapter consistently outperforms state-of-the-art approaches across all evaluation metrics.

Problem

Research questions and friction points this paper is trying to address.

Balances task-specific adaptation and generalization in few-shot VLMs

Addresses under-explored adapter-based methods for multimodal transfer learning

Preserves general knowledge while injecting task-specific knowledge via dual-branch architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-branch adapter balances task-specific and general knowledge

Reconstruction branch preserves original features with minimal overhead

Layer-local reconstruction loss and shared projections ensure lightweight design

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs