🤖 AI Summary
This work addresses the challenge of inefficient dynamic history-taking and diagnostic reasoning in multi-turn clinical consultations with large language models. The authors propose a medical-note-driven framework that transforms real-world clinical records into structured doctor–patient dialogues and employs a three-stage fine-tuning strategy—supervised fine-tuning, synthetic data augmentation, and preference learning. A key innovation lies in reformulating history-taking as a single-step reasoning sequence, which enhances model interpretability, enables localized supervision, and improves sample efficiency. Furthermore, high-quality dialogue data are generated under the guidance of decision trees, reducing reliance on scarce real conversational datasets. Experimental results demonstrate that the proposed method significantly outperforms GPT-4o on clinical reasoning tasks, achieving a 16.9-point gain in F1 score and a 21.0-point improvement in top-1 diagnostic accuracy.
📝 Abstract
Effective clinical history taking is a foundational yet underexplored component of clinical reasoning. While large language models (LLMs) have shown promise on static benchmarks, they often fall short in dynamic, multi-turn diagnostic settings that require iterative questioning and hypothesis refinement. To address this gap, we propose \method{}, a note-driven framework that trains LLMs to conduct structured history taking and diagnosis by learning from widely available medical notes. Instead of relying on scarce and sensitive dialogue data, we convert real-world medical notes into high-quality doctor-patient dialogues using a decision tree-guided generation and refinement pipeline. We then propose a three-stage fine-tuning strategy combining supervised learning, simulated data augmentation, and preference learning. Furthermore, we propose a novel single-turn reasoning paradigm that reframes history taking as a sequence of single-turn reasoning problems. This design enhances interpretability and enables local supervision, dynamic adaptation, and greater sample efficiency. Experimental results show that our method substantially improves clinical reasoning, achieving gains of +16.9 F1 and +21.0 Top-1 diagnostic accuracy over GPT-4o. Our code and dataset can be found at https://github.com/zhentingsheng/Note2Chat.