QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Traditional RAG approaches for knowledge-intensive visual question answering (VQA) suffer from unimodal retrieval limitations and inadequate support for multi-hop and cross-modal reasoning. To address this, we propose a query-aware dynamic RAG system featuring two novel routing mechanisms: domain routing—which identifies the query’s thematic domain—and search routing—which dynamically orchestrates text/image retrieval agents and multi-turn interaction strategies to enable multimodal coordination, multi-hop inference, and hybrid retrieval. Built upon a multimodal large language model (MLLM), our end-to-end framework achieves significant improvements in the KDD Cup 2025 Meta CRAG-MM Challenge: +5.06% accuracy on single-source, +6.35% on multi-source, and +5.03% on multi-turn tasks, alongside enhanced knowledge coverage. These results demonstrate a substantial breakthrough over unimodal retrieval bottlenecks.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) has been introduced to mitigate hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge into the generation process, and it has become a widely adopted approach for knowledge-intensive Visual Question Answering (VQA). However, existing RAG methods typically retrieve from either text or images in isolation, limiting their ability to address complex queries that require multi-hop reasoning or up-to-date factual knowledge. To address this limitation, we propose QA-Dragon, a Query-Aware Dynamic RAG System for Knowledge-Intensive VQA. Specifically, QA-Dragon introduces a domain router to identify the query's subject domain for domain-specific reasoning, along with a search router that dynamically selects optimal retrieval strategies. By orchestrating both text and image search agents in a hybrid setup, our system supports multimodal, multi-turn, and multi-hop reasoning, enabling it to tackle complex VQA tasks effectively. We evaluate our QA-Dragon on the Meta CRAG-MM Challenge at KDD Cup 2025, where it significantly enhances the reasoning performance of base models under challenging scenarios. Our framework achieves substantial improvements in both answer accuracy and knowledge overlap scores, outperforming baselines by 5.06% on the single-source task, 6.35% on the multi-source task, and 5.03% on the multi-turn task.

Problem

Research questions and friction points this paper is trying to address.

Enhances VQA by integrating text and image retrieval dynamically

Addresses complex queries needing multi-hop and up-to-date knowledge

Improves reasoning accuracy in multimodal, multi-turn VQA tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic domain router for query-specific reasoning

Hybrid text and image search agents

Multimodal multi-hop reasoning system

🔎 Similar Papers

No similar papers found.