Process-Supervised Multi-Agent Reinforcement Learning for Reliable Clinical Reasoning

📅 2026-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing large language model–based multi-agent systems in clinical decision-making, which prioritize outcome accuracy while neglecting reasoning processes aligned with clinical guidelines. To bridge this gap, the authors propose the first multi-agent reinforcement learning framework that integrates process supervision with outcome-based rewards. The approach employs a hierarchical collaboration mechanism to orchestrate the reasoning workflow and leverages the GRPO algorithm to train Qwen3-4B as a supervisory agent, aligning the reasoning trajectory with expert standards. Evaluated on the ClinGen gene–disease validity curation task, the method achieves an accuracy of 0.750 while significantly improving the reasoning-process F1 score to 0.520, outperforming baselines that rely solely on outcome rewards and effectively balancing both accuracy and procedural compliance.

Technology Category

Application Category

📝 Abstract
Clinical decision-making requires nuanced reasoning over heterogeneous evidence and traceable justifications. While recent LLM multi-agent systems (MAS) show promise, they largely optimise for outcome accuracy while overlooking process-grounded reasoning aligned with clinical standards. One critical real-world case of this is gene-disease validity curation, where experts must determine whether a gene is causally implicated in a disease by synthesising diverse biomedical evidence. We introduce an agent-as-tool reinforcement learning framework for this task with two objectives: (i) process-level supervision to ensure reasoning follows valid clinical pathways, and (ii) efficient coordination via a hierarchical multi-agent system. Our evaluation on the ClinGen dataset shows that with outcome-only rewards, MAS with a GRPO-trained Qwen3-4B supervisor agent substantially improves final outcome accuracy from 0.195 with a base model supervisor to 0.732, but results in poor process alignment (0.392 F1). Conversely, with process + outcome rewards, MAS with GRPO-trained supervisor achieves higher outcome accuracy (0.750) while significantly improving process fidelity to 0.520 F1. Our code is available at https://github.com/chaeeunlee-io/GeneDiseaseCurationAgents.
Problem

Research questions and friction points this paper is trying to address.

clinical reasoning
multi-agent reinforcement learning
process supervision
gene-disease validity
clinical decision-making
Innovation

Methods, ideas, or system contributions that make the work stand out.

process-supervised reinforcement learning
multi-agent system
clinical reasoning
hierarchical coordination
GRPO
🔎 Similar Papers
No similar papers found.
C
Chaeeun Lee
School of Informatics, University of Edinburgh, UK
T
T. Michael Yates
School of Informatics, University of Edinburgh, UK
Pasquale Minervini
Pasquale Minervini
University of Edinburgh, Miniml.AI, ELLIS Scholar
Generative AIMachine LearningNatural Language ProcessingMachine Reasoning
T
T. Ian Simpson
School of Informatics, University of Edinburgh, UK