Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing RAG systems predominantly operate in unimodal text-only settings, struggling to handle real-world scenarios where queries and documents contain both textual and visual content. This work introduces URAG—the first unified framework for general-purpose multimodal RAG—and Nyx, a hybrid-modality retrieval model supporting joint text-image inputs and cross-modal semantic alignment. We propose an automated pipeline for high-quality multimodal question-answer data generation, yielding the NyxQA benchmark. Nyx is trained in two stages: (1) pretraining on diverse open-source multimodal data, followed by (2) supervised fine-tuning guided by visual-language model feedback signals, enabling joint optimization of retrieval accuracy and generative preference. Experiments demonstrate that Nyx maintains competitive performance on standard text-only RAG benchmarks while significantly improving answer quality in multimodal retrieval tasks—achieving both generality across modalities and practical deployability.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.

Problem

Research questions and friction points this paper is trying to address.

Addressing retrieval challenges in mixed-modal scenarios with text and images

Developing universal RAG systems for vision-language generation enhancement

Overcoming scarcity of realistic mixed-modal training data for retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified mixed-modal to mixed-modal retriever for URAG

Automated pipeline generating filtered mixed-modal QA dataset

Two-stage training with pretraining and VLM-feedback fine-tuning

🔎 Similar Papers

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models