🤖 AI Summary
Existing multimodal sentence embedding models rely on image-caption pairs for training, yet such data often contain visual or textual noise, degrading semantic alignment quality. To address this, we propose MCSEO, the first method to incorporate fine-grained object-phrase alignment into vision-language contrastive learning: leveraging segmentation and detection models to localize objects in images and match them with corresponding descriptive phrases, thereby establishing precise cross-modal local alignments; and designing a novel contrastive loss function tailored to this alignment structure. MCSEO preserves the original backbone architecture and is plug-and-play. Evaluated on standard sentence retrieval benchmarks (e.g., STC), MCSEO consistently outperforms strong baselines across mainstream backbones—including CLIP and ALPRO—demonstrating that fine-grained alignment significantly enhances the robustness and discriminative power of multimodal sentence embeddings.
📝 Abstract
Multimodal sentence embedding models typically leverage image-caption pairs in addition to textual data during training. However, such pairs often contain noise, including redundant or irrelevant information on either the image or caption side. To mitigate this issue, we propose MCSEO, a method that enhances multimodal sentence embeddings by incorporating fine-grained object-phrase alignment alongside traditional image-caption alignment. Specifically, MCSEO utilizes existing segmentation and object detection models to extract accurate object-phrase pairs, which are then used to optimize a contrastive learning objective tailored to object-phrase correspondence. Experimental results on semantic textual similarity (STS) tasks across different backbone models demonstrate that MCSEO consistently outperforms strong baselines, highlighting the significance of precise object-phrase alignment in multimodal representation learning.