Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

πŸ“… 2025-03-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses multimodal document retrieval involving interleaved text and images. We propose ReT, the first Transformer-based model incorporating a recurrent mechanism: a sigmoid-gated cross-modal recursive unit that enables dynamic fusion and iterative comprehension of image and text queries across hierarchical vision-language representations. The method integrates multi-level feature extraction, cross-modal Transformer interaction, and end-to-end joint training. Evaluated on the M2KR and M-BEIR benchmarks, ReT achieves state-of-the-art performance across all metrics, significantly outperforming existing approaches. To foster reproducibility and further research, we release both the source code and pre-trained models.

Technology Category

Application Category

πŸ“ Abstract
Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries, composed of both an image and a text, and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that ReT achieves state-of-the-art performance across diverse settings. Our source code and trained models are publicly available at https://github.com/aimagelab/ReT.
Problem

Research questions and friction points this paper is trying to address.

Enhances multimodal document retrieval using recurrent Transformer cells.
Integrates visual and textual features for cross-modal understanding.
Achieves state-of-the-art performance on M2KR and M-BEIR benchmarks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal queries combining image and text
Transformer-based recurrent cell for feature integration
Sigmoidal gates inspired by LSTM designs
πŸ”Ž Similar Papers