QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional RAG approaches for knowledge-intensive visual question answering (VQA) suffer from unimodal retrieval limitations and inadequate support for multi-hop and cross-modal reasoning. To address this, we propose a query-aware dynamic RAG system featuring two novel routing mechanisms: domain routing—which identifies the query’s thematic domain—and search routing—which dynamically orchestrates text/image retrieval agents and multi-turn interaction strategies to enable multimodal coordination, multi-hop inference, and hybrid retrieval. Built upon a multimodal large language model (MLLM), our end-to-end framework achieves significant improvements in the KDD Cup 2025 Meta CRAG-MM Challenge: +5.06% accuracy on single-source, +6.35% on multi-source, and +5.03% on multi-turn tasks, alongside enhanced knowledge coverage. These results demonstrate a substantial breakthrough over unimodal retrieval bottlenecks.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) has been introduced to mitigate hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge into the generation process, and it has become a widely adopted approach for knowledge-intensive Visual Question Answering (VQA). However, existing RAG methods typically retrieve from either text or images in isolation, limiting their ability to address complex queries that require multi-hop reasoning or up-to-date factual knowledge. To address this limitation, we propose QA-Dragon, a Query-Aware Dynamic RAG System for Knowledge-Intensive VQA. Specifically, QA-Dragon introduces a domain router to identify the query's subject domain for domain-specific reasoning, along with a search router that dynamically selects optimal retrieval strategies. By orchestrating both text and image search agents in a hybrid setup, our system supports multimodal, multi-turn, and multi-hop reasoning, enabling it to tackle complex VQA tasks effectively. We evaluate our QA-Dragon on the Meta CRAG-MM Challenge at KDD Cup 2025, where it significantly enhances the reasoning performance of base models under challenging scenarios. Our framework achieves substantial improvements in both answer accuracy and knowledge overlap scores, outperforming baselines by 5.06% on the single-source task, 6.35% on the multi-source task, and 5.03% on the multi-turn task.
Problem

Research questions and friction points this paper is trying to address.

Enhances VQA by integrating text and image retrieval dynamically
Addresses complex queries needing multi-hop and up-to-date knowledge
Improves reasoning accuracy in multimodal, multi-turn VQA tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic domain router for query-specific reasoning
Hybrid text and image search agents
Multimodal multi-hop reasoning system
🔎 Similar Papers
No similar papers found.
Zhuohang Jiang
Zhuohang Jiang
PolyU
LLMRAGRecSys
Pangjing Wu
Pangjing Wu
The Hong Kong Polytechnic University
Reinforcement LearningNatural Language ProcessingData Mining
X
Xu Yuan
The Hong Kong Polytechnic University, Hong Kong SAR, China
W
Wenqi Fan
The Hong Kong Polytechnic University, Hong Kong SAR, China
Q
Qing Li
The Hong Kong Polytechnic University, Hong Kong SAR, China