Generalizing Vision-Language Models with Dedicated Prompt Guidance

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Visual-language models (VLMs) face a fundamental trade-off between domain specificity and generalization during downstream adaptation: full fine-tuning often degrades out-of-distribution generalization. To address this, we propose a two-stage domain-expert-guided framework. First, lightweight domain experts are constructed via domain-partitioned prompt tuning. Second, an adaptive expert fusion cross-modal attention module dynamically integrates expert knowledge. Theoretically, we prove that domain-partitioned training strictly dominates global fine-tuning under mild assumptions. Methodologically, we introduce the first prompt-driven domain expert ensemble mechanism and establish ImageNet-DG—a novel few-shot domain generalization benchmark. Experiments demonstrate consistent, significant improvements over state-of-the-art methods on standard domain generalization benchmarks and ImageNet-DG, while maintaining parameter efficiency—e.g., adding only 0.1% trainable parameters per expert.

Technology Category

Application Category

📝 Abstract

Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.

Problem

Research questions and friction points this paper is trying to address.

Addresses trade-off between domain specificity and generalization in vision-language models

Proposes expert-guided framework to enhance generalization to unseen domains

Introduces new dataset for evaluating few-shot domain generalization performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiple expert models for domain generalization

Cross-Modal Attention guides vision encoder fine-tuning

Prompt tuning adapts vision-language models efficiently

🔎 Similar Papers

VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks