🤖 AI Summary
Existing LLM-based diagnostic models rely on static case data and lack dynamic, multi-turn diagnostic capabilities. To address this, we propose DiagGym—the first virtual clinical reinforcement learning environment built upon real-world electronic health records. Methodologically, we design the DiagGym world model and DiagAgent diagnostic agent, integrating end-to-end multi-turn reinforcement learning, conditional generative modeling, and diagnosis strategy optimization to jointly optimize both test-ordering recommendations and final diagnostic decisions within an interactive setting. We further introduce DiagBench, a physician-annotated evaluation benchmark. Experiments demonstrate that our approach significantly outperforms ten state-of-the-art large language models and prompting methods across diagnostic accuracy, test-recommendation F1-score, and holistic diagnostic score—achieving maximum improvements of 15.12%, 23.09%, and 7.1% over Claude-sonnet-4, respectively. This work establishes the first evaluable and optimizable clinical diagnostic strategy learning framework grounded in interactive reinforcement learning.
📝 Abstract
In this paper, we present a framework for training large language models (LLMs) as diagnostic agents with reinforcement learning, enabling them to manage multi-turn diagnostic processes, adaptively select examinations, and commit to final diagnoses. Unlike instruction-tuned models trained on static case summaries, our method acquires diagnostic strategies through interactive exploration and outcome-based feedback. Our contributions are fourfold: (i) We present DiagGym, a diagnostics world model trained with electronic health records that emits examination outcomes conditioned on patient history and recommended examination, serving as a virtual clinical environment for realistic diagnosis training and evaluation; (ii) We train DiagAgent via end-to-end, multi-turn reinforcement learning to learn diagnostic policies that optimize both information yield and diagnostic accuracy; (iii) We introduce DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated examination recommendations and 99 cases annotated with 973 physician-written rubrics on diagnosis process; (iv) we demonstrate superior performance across diverse diagnostic settings. DiagAgent significantly outperforms 10 state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34% higher diagnostic accuracy and 44.03% improvement in examination recommendation hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic accuracy and 23.09% boost in examination recommendation F1 score. In rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by 7.1% in weighted rubric score. These findings indicate that learning policies in interactive clinical environments confers dynamic and clinically meaningful diagnostic management abilities unattainable through passive training alone.