🤖 AI Summary
Existing image-text matching models exhibit poor generalization to unseen target domains in cross-domain settings. Method: This paper proposes an unsupervised domain-invariant representation learning framework. Its core innovation is the first coupling of semantic clustering with cross-modal graph-structure alignment: fine-grained semantic clusters—driven by K-means—are used to discover transferable correspondences between image local regions and text tokens; a heterogeneous graph neural network (HGNN) is then constructed to model the dual-modal topological structure; finally, a contrastive graph-matching loss enforces cross-domain alignment of these graph structures. Contribution/Results: The method requires no target-domain annotations and significantly improves out-of-domain generalization. On multi-domain transfer benchmarks—including Flickr30K and COCO—it achieves absolute R@1 gains of 6.2–9.8%, surpassing state-of-the-art domain-generalization methods for image-text matching.