Scaling Medical Reasoning Verification via Tool-Integrated Reinforcement Learning

📅 2026-01-28
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing medical reasoning verification methods, which rely on single-step retrieval and provide only scalar rewards, thereby lacking interpretability and dynamic knowledge acquisition. The authors propose a tool-augmented reinforcement learning agent framework that iteratively queries external medical corpora during verification and integrates trajectory-supervised iterative reinforcement learning with an adaptive curriculum mechanism. This approach enables dynamic evidence retrieval for the first time, substantially improving both verification efficiency and reliability. Evaluated on four medical reasoning benchmarks, the method significantly outperforms current state-of-the-art approaches, achieving a 23.5% absolute accuracy gain on MedQA and a 32.0% improvement on MedXpertQA, while reducing the required sampling budget by a factor of eight.

Technology Category

Application Category

📝 Abstract
Large language models have achieved strong performance on medical reasoning benchmarks, yet their deployment in clinical settings demands rigorous verification to ensure factual accuracy. While reward models offer a scalable approach for reasoning trace verification, existing methods face two limitations: they produce only scalar reward values without explicit justification, and they rely on single-pass retrieval that precludes adaptive knowledge access as verification unfolds. We introduce $\method$, an agentic framework that addresses these limitations by training medical reasoning verifiers to iteratively query external medical corpora during evaluation. Our approach combines tool-augmented verification with an iterative reinforcement learning paradigm that requires only trace-level supervision, alongside an adaptive curriculum mechanism that dynamically adjusts training data distribution. Across four medical reasoning benchmarks, $\method$ achieves substantial gains over existing methods, improving MedQA accuracy by 23.5% and MedXpertQA by 32.0% relative to the base generator in particular. Crucially, $\method$ demonstrates an $\mathbf{8\times}$ reduction in sampling budget requirement compared to prior reward model baselines. These findings establish that grounding verification in dynamically retrieved evidence offers a principled path toward more reliable medical reasoning systems.
Problem

Research questions and friction points this paper is trying to address.

medical reasoning verification
reward models
factual accuracy
adaptive knowledge access
clinical deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

tool-integrated reinforcement learning
iterative verification
adaptive retrieval
medical reasoning
agent-based verification
🔎 Similar Papers
No similar papers found.