MIRe: Enhancing Multimodal Queries Representation via Fusion-Free Modality Interaction for Multimodal Retrieval

๐Ÿ“… 2024-11-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing multimodal retrieval methods often suffer from text dominance and visual feature degradation due to direct fusion of textual and visual representations. To address this, we propose MIReโ€”a โ€œfusion-freeโ€ dual-stream alignment framework that decouples modality processing. In MIRe, a text query dynamically attends to visual representations via a novel attention isolation mechanism, without back-propagating textual signals into the visual branch. We further introduce a question-to-paragraph pretraining dataset and a zero-shot cross-modal retrieval paradigm. Evaluated on four mainstream benchmarks, MIRe achieves substantial improvements in zero-shot retrieval performance. It significantly enhances visual information utilization and fine-grained query understanding, demonstrating superior alignment between semantic intent and visual content without compromising modality-specific representation integrity.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent multimodal retrieval methods have endowed text-based retrievers with multimodal capabilities by utilizing pre-training strategies for visual-text alignment. They often directly fuse the two modalities for cross-reference during the alignment to understand multimodal queries. However, existing methods often overlook crucial visual information due to a text-dominant issue, which overly depends on text-driven signals. In this paper, we introduce MIRe, a retrieval framework that achieves modality interaction without fusing textual features during the alignment. Our method allows the textual query to attend to visual embeddings while not feeding text-driven signals back into the visual representations. Additionally, we construct a pre-training dataset for multimodal query retrieval by transforming concise question-answer pairs into extended passages. Our experiments demonstrate that our pre-training strategy significantly enhances the understanding of multimodal queries, resulting in strong performance across four multimodal retrieval benchmarks under zero-shot settings. Our code is publicly available: https://github.com/yeongjoonJu/MIRe.
Problem

Research questions and friction points this paper is trying to address.

Enhances multimodal query representation
Avoids text-dominant modality fusion
Improves zero-shot multimodal retrieval performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fusion-free modality interaction
Visual-text alignment enhancement
Extended passages pre-training dataset