🤖 AI Summary
Existing methods predominantly rely on task-specific fine-tuning and support only unimodal queries or documents, failing to handle multimodal (e.g., image-text) queries and retrieve from multimodal document collections. To address this, we propose ReT-2—the first end-to-end model unifying multimodal query understanding and multimodal document retrieval. Its core innovation is the integration of an LSTM-inspired gated mechanism into the Transformer architecture, enabling recursive, cross-layer, cross-modal fusion for dynamic, fine-grained semantic alignment and efficient information integration. Built upon vision-language pretraining foundations, ReT-2 jointly optimizes multimodal representation learning across layers with gated information aggregation. On benchmarks including M2KR and M-BEIR, ReT-2 achieves state-of-the-art performance, with faster inference and reduced memory footprint. Moreover, it significantly enhances retrieval-augmented generation—improving both visual question answering and information-seeking accuracy.
📝 Abstract
With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2