🤖 AI Summary
Addressing the challenge of disentangling domain-invariant visual features in domain generalization, this paper proposes a language-guided disentangled prompt learning framework. Methodologically, it (1) leverages large language models to automatically decompose textual prompts into semantically orthogonal abstract concept prompts; (2) introduces Worst-case Explicit Representation Alignment (WERA) loss to enforce representation consistency of visual prompts across style-augmented images; and (3) integrates pretrained vision-language foundation models (e.g., CLIP) to achieve cross-modal disentangled alignment between text and vision. Evaluated on five standard benchmarks—PACS, VLCS, OfficeHome, DomainNet, and TerraInc—the method consistently outperforms state-of-the-art approaches, achieving average accuracy gains of 2.1–4.7 percentage points. These results empirically validate the efficacy of language-prior-driven prompt disentanglement for enhancing cross-domain generalization capability.
📝 Abstract
Domain Generalization (DG) seeks to develop a versatile model capable of performing effectively on unseen target domains. Notably, recent advances in pre-trained Visual Foundation Models (VFMs), such as CLIP, have demonstrated considerable potential in enhancing the generalization capabilities of deep learning models. Despite the increasing attention toward VFM-based domain prompt tuning within DG, the effective design of prompts capable of disentangling invariant features across diverse domains remains a critical challenge. In this paper, we propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM. Noting that the text modality of VFMs is naturally easier to disentangle, we introduce a novel framework for text feature-guided visual prompt tuning. This framework first automatically disentangles the text prompt using a large language model (LLM) and then learns domain-invariant visual representation guided by the disentangled text feature. However, relying solely on language to guide visual feature disentanglement has limitations, as visual features can sometimes be too complex or nuanced to be fully captured by descriptive text. To address this, we introduce Worst Explicit Representation Alignment (WERA), which extends text-guided visual prompts by incorporating an additional set of abstract prompts. These prompts enhance source domain diversity through stylized image augmentations, while alignment constraints ensure that visual representations remain consistent across both the original and augmented distributions. Experiments conducted on major DG datasets, including PACS, VLCS, OfficeHome, DomainNet, and TerraInc, demonstrate that our proposed method outperforms state-of-the-art DG methods.