Recurrence Meets Transformers for Universal Multimodal Retrieval

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Existing methods predominantly rely on task-specific fine-tuning and support only unimodal queries or documents, failing to handle multimodal (e.g., image-text) queries and retrieve from multimodal document collections. To address this, we propose ReT-2—the first end-to-end model unifying multimodal query understanding and multimodal document retrieval. Its core innovation is the integration of an LSTM-inspired gated mechanism into the Transformer architecture, enabling recursive, cross-layer, cross-modal fusion for dynamic, fine-grained semantic alignment and efficient information integration. Built upon vision-language pretraining foundations, ReT-2 jointly optimizes multimodal representation learning across layers with gated information aggregation. On benchmarks including M2KR and M-BEIR, ReT-2 achieves state-of-the-art performance, with faster inference and reduced memory footprint. Moreover, it significantly enhances retrieval-augmented generation—improving both visual question answering and information-seeking accuracy.

Technology Category

Application Category

📝 Abstract

With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2

Problem

Research questions and friction points this paper is trying to address.

Universal multimodal retrieval with unified model

Dynamic integration of cross-modal information

Improved performance in retrieval-augmented generation pipelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recurrent Transformer with gating mechanisms

Multimodal queries and document retrieval

Dynamic integration across layers and modalities

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs