🤖 AI Summary
This study addresses the challenge that current large medical language models struggle to emulate the real-world clinical diagnostic process, which relies on incomplete information, multi-turn interactions, and dynamic hypothesis updating. To bridge this gap, the authors propose a tree-based distillation framework in which a large model interacts with a simulated clinical environment to generate multi-turn diagnostic trajectories. They introduce two knowledge graph–driven metrics—Disease Trajectory Consistency (DTC) and Reasoning–Action Consistency (RAC)—to filter high-quality diagnostic data. The work presents the first systematic analysis of three failure modes of large models in active diagnosis, introduces MedAction-32K—the first large-scale multi-turn active diagnosis dataset—and demonstrates that an 8B model fine-tuned on this dataset achieves state-of-the-art open-source performance on both MedR-Bench and the newly curated MedAction-300-Hard benchmark, substantially advancing multi-turn diagnostic reasoning capabilities.
📝 Abstract
Most existing LLM diagnoses are evaluated on static, single-turn settings where complete patient information is provided upfront, an oversimplification of real clinical practice. We study active diagnosis: the real-life clinical process of starting from initial observation, ordering tests, interpreting results, and updating a differential diagnosis across multiple turns. Through systematic analysis, we identify three recurring failure modes in current LLMs: ungrounded test ordering, unreliable diagnostic update, and degraded multi-turn coherence. Together, these failures reveal a core deficit: existing medical training data teaches models to reason from complete information but not to act under evolving, partial evidence. To address this gap, we introduce MedAction, a tree-structured distillation pipeline that synthesizes diverse and high-quality multi-turn diagnostic trajectories via LLM-environment interaction. We propose two knowledge-graph-grounded metrics to filter trajectory quality: Disease Trajectory Consistency (DTC), which tracks whether the model's hypothesis converges toward the correct diagnosis, and Reasoning-Action Consistency (RAC), which verifies that belief updates are driven by gathered evidence. Using this pipeline, we construct MedAction-32K, a dataset of 32,681 trajectories from 2,896 PMC cases. Fine-tuning an 8B model on MedAction-32K achieves state-of-the-art performance among open-source models on both MedR-Bench and our curated MedAction-300-Hard benchmark, pushing the edge for open-source medical LLMs.