From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the susceptibility of large language models to hallucinations and outdated knowledge in medical question answering, as well as the limited multi-hop reasoning capability and noise robustness of existing retrieval-augmented generation (RAG) approaches. To overcome these limitations, the authors propose MA-RAG, a novel framework that leverages semantic conflicts among candidate answers as an active signal to drive iterative optimization of retrieval queries and reasoning trajectories through a multi-agent process, thereby enabling co-evolution of retrieval and reasoning. Integrating self-consistency principles with test-time scaling, MA-RAG achieves an average accuracy improvement of 6.8 points across seven medical QA benchmarks, significantly outperforming current RAG and test-time scaling methods.
📝 Abstract
Large Language Models (LLMs) exhibit high reasoning capacity in medical question-answering, but their tendency to produce hallucinations and outdated knowledge poses critical risks in healthcare fields. While Retrieval-Augmented Generation (RAG) mitigates these issues, existing methods rely on noisy token-level signals and lack the multi-round refinement required for complex reasoning. In the paper, we propose MA-RAG (Multi-Round Agentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop. At each round, the agent transforms semantic conflict among candidate responses into actionable queries to retrieve external evidence, while optimizing history reasoning traces to mitigate long-context degradation. MA-RAG extends the self-consistency principle by leveraging the lack of consistency as a proactive signal for multi-round agentic reasoning and retrieval, and mirrors a boosting mechanism that iteratively minimizes the residual error toward a stable, high-fidelity medical consensus. Extensive evaluations across 7 medical Q&A benchmarks show that MA-RAG consistently surpasses competitive inference-time scaling and RAG baselines, delivering substantial +6.8 points on average accuracy over the backbone model. Our code is available at https://github.com/NJU-RL/MA-RAG.
Problem

Research questions and friction points this paper is trying to address.

medical reasoning
hallucination
outdated knowledge
Retrieval-Augmented Generation
multi-round refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Round Agentic RAG
Medical Reasoning
Retrieval-Augmented Generation
Self-Consistency
Test-Time Scaling
🔎 Similar Papers
No similar papers found.