Improving Multimodal Contrastive Learning of Sentence Embeddings with Object-Phrase Alignment

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

142K/year

🤖 AI Summary

Existing multimodal sentence embedding models rely on image-caption pairs for training, yet such data often contain visual or textual noise, degrading semantic alignment quality. To address this, we propose MCSEO, the first method to incorporate fine-grained object-phrase alignment into vision-language contrastive learning: leveraging segmentation and detection models to localize objects in images and match them with corresponding descriptive phrases, thereby establishing precise cross-modal local alignments; and designing a novel contrastive loss function tailored to this alignment structure. MCSEO preserves the original backbone architecture and is plug-and-play. Evaluated on standard sentence retrieval benchmarks (e.g., STC), MCSEO consistently outperforms strong baselines across mainstream backbones—including CLIP and ALPRO—demonstrating that fine-grained alignment significantly enhances the robustness and discriminative power of multimodal sentence embeddings.

Technology Category

Application Category

📝 Abstract

Multimodal sentence embedding models typically leverage image-caption pairs in addition to textual data during training. However, such pairs often contain noise, including redundant or irrelevant information on either the image or caption side. To mitigate this issue, we propose MCSEO, a method that enhances multimodal sentence embeddings by incorporating fine-grained object-phrase alignment alongside traditional image-caption alignment. Specifically, MCSEO utilizes existing segmentation and object detection models to extract accurate object-phrase pairs, which are then used to optimize a contrastive learning objective tailored to object-phrase correspondence. Experimental results on semantic textual similarity (STS) tasks across different backbone models demonstrate that MCSEO consistently outperforms strong baselines, highlighting the significance of precise object-phrase alignment in multimodal representation learning.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multimodal sentence embeddings via object-phrase alignment

Reducing noise in image-caption pairs for better training

Improving contrastive learning with precise object-phrase correspondence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhances embeddings with object-phrase alignment

Uses segmentation for accurate object-phrase pairs

Optimizes contrastive learning for object correspondence

🔎 Similar Papers

Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment