Recurrence Meets Transformers for Universal Multimodal Retrieval

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods predominantly rely on task-specific fine-tuning and support only unimodal queries or documents, failing to handle multimodal (e.g., image-text) queries and retrieve from multimodal document collections. To address this, we propose ReT-2—the first end-to-end model unifying multimodal query understanding and multimodal document retrieval. Its core innovation is the integration of an LSTM-inspired gated mechanism into the Transformer architecture, enabling recursive, cross-layer, cross-modal fusion for dynamic, fine-grained semantic alignment and efficient information integration. Built upon vision-language pretraining foundations, ReT-2 jointly optimizes multimodal representation learning across layers with gated information aggregation. On benchmarks including M2KR and M-BEIR, ReT-2 achieves state-of-the-art performance, with faster inference and reduced memory footprint. Moreover, it significantly enhances retrieval-augmented generation—improving both visual question answering and information-seeking accuracy.

Technology Category

Application Category

📝 Abstract
With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2
Problem

Research questions and friction points this paper is trying to address.

Universal multimodal retrieval with unified model
Dynamic integration of cross-modal information
Improved performance in retrieval-augmented generation pipelines
Innovation

Methods, ideas, or system contributions that make the work stand out.

Recurrent Transformer with gating mechanisms
Multimodal queries and document retrieval
Dynamic integration across layers and modalities
🔎 Similar Papers
No similar papers found.