Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the limitation of existing large models in cross-modal retrieval, which often neglect subject-level semantics, leading to visual oversight and semantic drift that hinder precise alignment between key image regions and textual descriptions. To overcome this, the authors propose the SSA-ME framework, which introduces, for the first time, a subject-level saliency modeling mechanism. This mechanism employs saliency-aware guidance to direct cross-modal attention toward semantic cores and integrates a feature regeneration module to recalibrate visual features, thereby achieving balanced and semantically consistent fusion across modalities. Evaluated on the MMEB benchmark, the proposed method achieves state-of-the-art performance, significantly enhancing fine-grained retrieval accuracy while offering strong interpretability.

📝 Abstract

Despite significant progress in Unified Multimodal Retrieval (UMR) powered by Large Multimodal Models (LMMs), existing embedding methods primarily focus on sample-level objectives via contrastive learning while overlooking the crucial subject-level semantics. This limitation hinders the model's ability to group semantically coherent subjects in complex multimodal queries, manifesting as semantic alignment deviation--where models fail to accurately localize salient text-referred regions in visual content. Moreover, without explicit guidance to model salient visual subjects, LMMs tend to over-rely on textual cues, resulting in visual modality neglect and suboptimal utilization of visual knowledge. To this end, we propose Salient Subject-Aware Multimodal Embedding (SSA-ME), a novel framework designed to enhance fine-grained representation learning through saliency-aware modeling. SSA-ME leverages LMMs and visual experts to identify and emphasize salient visual concepts in image-text pairs, and introduces a saliency-guided objective to better align cross-modal attention with semantically meaningful regions. Additionally, a feature regeneration module recalibrates visual features based on the derived saliency maps, ensuring a balanced and semantically coherent integration across modalities. Extensive experiments show that our method achieves state-of-the-art performance on the MMEB benchmark, demonstrating that incorporating subject-level modeling substantially improves multimodal retrieval. Comprehensive qualitative analyses further illustrate the interpretability and effectiveness of our approach.

Problem

Research questions and friction points this paper is trying to address.

Visual Neglect

Semantic Drift

Cross-Modal Retrieval

Subject-Level Semantics

Multimodal Embedding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Salient Subject-Aware

Multimodal Embedding

Cross-Modal Retrieval