🤖 AI Summary
This work addresses the limitation of existing large models in cross-modal retrieval, which often neglect subject-level semantics, leading to visual oversight and semantic drift that hinder precise alignment between key image regions and textual descriptions. To overcome this, the authors propose the SSA-ME framework, which introduces, for the first time, a subject-level saliency modeling mechanism. This mechanism employs saliency-aware guidance to direct cross-modal attention toward semantic cores and integrates a feature regeneration module to recalibrate visual features, thereby achieving balanced and semantically consistent fusion across modalities. Evaluated on the MMEB benchmark, the proposed method achieves state-of-the-art performance, significantly enhancing fine-grained retrieval accuracy while offering strong interpretability.
📝 Abstract
Despite significant progress in Unified Multimodal Retrieval (UMR) powered by Large Multimodal Models (LMMs), existing embedding methods primarily focus on sample-level objectives via contrastive learning while overlooking the crucial subject-level semantics. This limitation hinders the model's ability to group semantically coherent subjects in complex multimodal queries, manifesting as semantic alignment deviation--where models fail to accurately localize salient text-referred regions in visual content. Moreover, without explicit guidance to model salient visual subjects, LMMs tend to over-rely on textual cues, resulting in visual modality neglect and suboptimal utilization of visual knowledge. To this end, we propose Salient Subject-Aware Multimodal Embedding (SSA-ME), a novel framework designed to enhance fine-grained representation learning through saliency-aware modeling. SSA-ME leverages LMMs and visual experts to identify and emphasize salient visual concepts in image-text pairs, and introduces a saliency-guided objective to better align cross-modal attention with semantically meaningful regions. Additionally, a feature regeneration module recalibrates visual features based on the derived saliency maps, ensuring a balanced and semantically coherent integration across modalities. Extensive experiments show that our method achieves state-of-the-art performance on the MMEB benchmark, demonstrating that incorporating subject-level modeling substantially improves multimodal retrieval. Comprehensive qualitative analyses further illustrate the interpretability and effectiveness of our approach.