๐ค AI Summary
Existing multimodal retrieval methods often suffer from text dominance and visual feature degradation due to direct fusion of textual and visual representations. To address this, we propose MIReโa โfusion-freeโ dual-stream alignment framework that decouples modality processing. In MIRe, a text query dynamically attends to visual representations via a novel attention isolation mechanism, without back-propagating textual signals into the visual branch. We further introduce a question-to-paragraph pretraining dataset and a zero-shot cross-modal retrieval paradigm. Evaluated on four mainstream benchmarks, MIRe achieves substantial improvements in zero-shot retrieval performance. It significantly enhances visual information utilization and fine-grained query understanding, demonstrating superior alignment between semantic intent and visual content without compromising modality-specific representation integrity.
๐ Abstract
Recent multimodal retrieval methods have endowed text-based retrievers with multimodal capabilities by utilizing pre-training strategies for visual-text alignment. They often directly fuse the two modalities for cross-reference during the alignment to understand multimodal queries. However, existing methods often overlook crucial visual information due to a text-dominant issue, which overly depends on text-driven signals. In this paper, we introduce MIRe, a retrieval framework that achieves modality interaction without fusing textual features during the alignment. Our method allows the textual query to attend to visual embeddings while not feeding text-driven signals back into the visual representations. Additionally, we construct a pre-training dataset for multimodal query retrieval by transforming concise question-answer pairs into extended passages. Our experiments demonstrate that our pre-training strategy significantly enhances the understanding of multimodal queries, resulting in strong performance across four multimodal retrieval benchmarks under zero-shot settings. Our code is publicly available: https://github.com/yeongjoonJu/MIRe.