🤖 AI Summary
To address critical challenges in Radiology Visual Question Answering (RVQA)—including low factual accuracy, frequent hallucinations, and insufficient cross-modal alignment—this paper proposes a multi-agent collaborative framework. It introduces three functionally specialized agents: a Context Interpreter, a Multimodal Reasoner, and an Answer Verifier, enabling interpretable, complex reasoning through explicit role division and coordinated interaction. To enhance evaluation rigor, we innovatively employ model disagreement filtering to construct a high-difficulty benchmark dataset. Furthermore, the framework integrates multimodal large language models (MLLMs) with retrieval-augmented generation (RAG) to strengthen clinical knowledge grounding and enforce factual constraints. Experimental results demonstrate that our approach significantly outperforms state-of-the-art MLLM baselines on challenging RVQA benchmarks, achieving superior robustness, interpretability, and clinical applicability.
📝 Abstract
Radiology visual question answering (RVQA) provides precise answers to questions about chest X-ray images, alleviating radiologists' workload. While recent methods based on multimodal large language models (MLLMs) and retrieval-augmented generation (RAG) have shown promising progress in RVQA, they still face challenges in factual accuracy, hallucinations, and cross-modal misalignment. We introduce a multi-agent system (MAS) designed to support complex reasoning in RVQA, with specialized agents for context understanding, multimodal reasoning, and answer validation. We evaluate our system on a challenging RVQA set curated via model disagreement filtering, comprising consistently hard cases across multiple MLLMs. Extensive experiments demonstrate the superiority and effectiveness of our system over strong MLLM baselines, with a case study illustrating its reliability and interpretability. This work highlights the potential of multi-agent approaches to support explainable and trustworthy clinical AI applications that require complex reasoning.