🤖 AI Summary
Current multimodal models (e.g., CLIP) suffer from three key limitations: (1) overreliance on image-text pairs while neglecting inter-sample semantic relationships; (2) exclusive reliance on global embedding matching, lacking fine-grained alignment within relation-aware subspaces; and (3) disregard for intra-modal consistency. To address these, we propose Relation-guided Contrastive Multimodal Learning (RCML), the first framework to explicitly model natural-language-described semantic relations as many-to-many training structures. RCML introduces a relation-conditioned cross-attention mechanism and jointly optimizes cross-modal and intra-modal contrastive objectives for fine-grained semantic alignment. It integrates relation-aware feature modulation, many-to-many semantic sampling, and joint global-local context modeling. Extensive experiments demonstrate that RCML significantly outperforms strong baselines—including CLIP—across multiple benchmarks, with consistent gains in both retrieval and classification tasks, validating the effectiveness and generalizability of relation-guided learning.
📝 Abstract
Multimodal representation learning has advanced rapidly with contrastive models such as CLIP, which align image-text pairs in a shared embedding space. However, these models face limitations: (1) they typically focus on image-text pairs, underutilizing the semantic relations across different pairs. (2) they directly match global embeddings without contextualization, overlooking the need for semantic alignment along specific subspaces or relational dimensions; and (3) they emphasize cross-modal contrast, with limited support for intra-modal consistency. To address these issues, we propose Relation-Conditioned Multimodal Learning RCML, a framework that learns multimodal representations under natural-language relation descriptions to guide both feature extraction and alignment. Our approach constructs many-to-many training pairs linked by semantic relations and introduces a relation-guided cross-attention mechanism that modulates multimodal representations under each relation context. The training objective combines inter-modal and intra-modal contrastive losses, encouraging consistency across both modalities and semantically related samples. Experiments on different datasets show that RCML consistently outperforms strong baselines on both retrieval and classification tasks, highlighting the effectiveness of leveraging semantic relations to guide multimodal representation learning.