SHED: Style-Homogenized Embedding Alignment for Domain Generalization

πŸ“… 2026-05-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

195K/year
πŸ€– AI Summary
This work addresses the performance degradation of vision-language models in domain generalization, which stems from cross-modal information asymmetry: image embeddings encode both class semantics and domain-specific styles, whereas textual embeddings provide only basic category cues. To mitigate this issue, the authors propose SHED, a method that achieves style-homogenized embedding alignment within the CLIP framework. During training, SHED removes the domain-style centroid from image embeddings and the global centroid from multi-template text embeddings. At inference, it projects the textual domain centroid into the visual space and fuses predictions using membership-based weights. SHED introduces, for the first time, a style-disentangled cross-modal alignment mechanism that substantially alleviates information asymmetry, achieving state-of-the-art results across five domain generalization benchmarksβ€”e.g., a 4.0% improvement over standard fine-tuning on DomainNet.
πŸ“ Abstract
Domain generalization aims to enhance model robustness against unseen domains with embedding distribution shifts. While large-scale vision-language models like CLIP exhibit strong generalization, their direct image-text embedding alignment suffers from inherent information asymmetry: images encode both class semantics and domain-specific styles, whereas text prompts primarily convey basic class cues. This asymmetry hinders generalization to novel domains in realistic scenarios. To address this, we propose Style-Homogenized Embedding alignment for Domain-generalization (SHED), a novel CLIP-based method that aligns style-homogenized embeddings instead of raw representations from encoders in CLIP. During training, SHED removes domain-specific style centroids from both image embeddings computed per source domains and text embeddings which are averaged across diverse prompt templates and stripped of a global centroid. For inference, considering the lack of target domain information, SHED projects diverse textual domain centroids into the visual space and aggregates predictions via membership weighting. Extensive experiments on five benchmarks show SHED achieves state-of-the-art performance, outperforming prior methods significantly (e.g., +4.0\% on DomainNet vs. standard fine-tuning).
Problem

Research questions and friction points this paper is trying to address.

domain generalization
embedding alignment
information asymmetry
vision-language models
style bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

domain generalization
embedding alignment
style homogenization
CLIP
vision-language models
πŸ”Ž Similar Papers
No similar papers found.