CrossFlowDG: Bridging the Modality Gap with Cross-modal Flow Matching for Domain Generalization

📅 2026-04-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
This work addresses two key challenges in domain generalization: models’ overreliance on domain-specific stylistic cues at the expense of category semantics, and the residual modality gap persisting in existing vision–language contrastive approaches. To this end, the authors propose a cross-modal flow matching mechanism within a unified Euclidean embedding space, which explicitly transports domain-biased image embeddings toward their corresponding domain-invariant text embeddings through continuous transformations. This approach represents the first application of noise-free flow matching to vision–language alignment, effectively closing the remaining geometric discrepancy between modalities and enabling tighter semantic correspondence. Built upon a VMamba image encoder and a CLIP text encoder, the method achieves state-of-the-art performance on four established domain generalization benchmarks, including the best-reported result on TerraIncognita.

Technology Category

Application Category

📝 Abstract
Domain generalization (DG) aims to maintain performance under domain shift, which in computer vision appears primarily as stylistic variations that cause models to overfit to domain-specific appearance cues rather than class semantics. To overcome this, recent methods use textual representations as stable, domain-invariant anchors. However, multimodal approaches that rely on cosine similarity-based contrastive alignment leave a modality gap where image and text embeddings remain geometrically separated despite semantic correspondence. We propose CrossFlowDG, a novel DG framework that addresses this residual gap using noise-free, cross-modal flow matching. By learning a continuous transformation in the joint Euclidean latent space, our framework explicitly transports domain-biased image embeddings toward domain-invariant text embeddings of the correct class. Using the efficient VMamba image encoder and CLIP's text encoder, CrossFlowDG is tested against four common DG benchmarks, and achieves competitive performance on several benchmarks and state-of-the-art on TerraIncognita. Code is available at: https://github.com/ajkrit/CrossFlowDG
Problem

Research questions and friction points this paper is trying to address.

domain generalization
modality gap
cross-modal alignment
embedding discrepancy
semantic correspondence
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal flow matching
domain generalization
modality gap
embedding transport
multimodal alignment