Generalizing Vision-Language Models with Dedicated Prompt Guidance

๐Ÿ“… 2025-12-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Visual-language models (VLMs) face a fundamental trade-off between domain specificity and generalization during downstream adaptation: full fine-tuning often degrades out-of-distribution generalization. To address this, we propose a two-stage domain-expert-guided framework. First, lightweight domain experts are constructed via domain-partitioned prompt tuning. Second, an adaptive expert fusion cross-modal attention module dynamically integrates expert knowledge. Theoretically, we prove that domain-partitioned training strictly dominates global fine-tuning under mild assumptions. Methodologically, we introduce the first prompt-driven domain expert ensemble mechanism and establish ImageNet-DGโ€”a novel few-shot domain generalization benchmark. Experiments demonstrate consistent, significant improvements over state-of-the-art methods on standard domain generalization benchmarks and ImageNet-DG, while maintaining parameter efficiencyโ€”e.g., adding only 0.1% trainable parameters per expert.

Technology Category

Application Category

๐Ÿ“ Abstract
Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.
Problem

Research questions and friction points this paper is trying to address.

Addresses trade-off between domain specificity and generalization in vision-language models
Proposes expert-guided framework to enhance generalization to unseen domains
Introduces new dataset for evaluating few-shot domain generalization performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiple expert models for domain generalization
Cross-Modal Attention guides vision encoder fine-tuning
Prompt tuning adapts vision-language models efficiently
๐Ÿ”Ž Similar Papers
Xinyao Li
Xinyao Li
University of Electronic Science and Technology of China
Y
Yinjie Min
School of Statistics and Data Science, Nankai University
H
Hongbo Chen
School of Computer Science and Engineering, University of Electronic Science and Technology of China
Zhekai Du
Zhekai Du
University of Electronic Science and Technology of China
Domain AdaptationGenerative ModelsParameter-Efficient Fine-Tuning
Fengling Li
Fengling Li
University of Technology Sydney
Cross-modal AnalysisDomain AdaptationMultimodal Learning
J
Jingjing Li
School of Computer Science and Engineering, University of Electronic Science and Technology of China