🤖 AI Summary
Existing LLM-based clinical evaluation benchmarks rely on static question-answering, failing to capture the dynamic, iterative nature of real-world clinical reasoning—such as multi-turn patient history gathering, differential diagnosis refinement, and prioritized test ordering—while suffering from data contamination and coarse-grained assessment.
Method: We propose the first dynamic diagnostic dialogue evaluation framework: (1) automatically generating realistic patient cases via disease knowledge graphs; (2) simulating authentic interactions using hybrid rule- and generative-based patient agents; (3) employing a doctor agent to drive multi-turn diagnostic reasoning; and (4) introducing a fine-grained quality scoring system assessing hypothesis generation, test prioritization, and iterative differential diagnosis, alongside response efficiency metrics.
Results: Experiments expose systematic deficits of state-of-the-art LLMs in dynamic clinical reasoning, significantly enhancing clinical fidelity, interpretability, and transparency of the diagnostic process.
📝 Abstract
Clinical diagnosis begins with doctor-patient interaction, during which physicians iteratively gather information, determine examination and refine differential diagnosis through patients' response. This dynamic clinical-reasoning process is poorly represented by existing LLM benchmarks that focus on static question-answering. To mitigate these gaps, recent methods explore dynamic medical frameworks involving interactive clinical dialogues. Although effective, they often rely on limited, contamination-prone datasets and lack granular, multi-level evaluation. In this work, we propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues. Grounded in a disease knowledge graph, our method dynamically generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent. Our evaluation protocol goes beyond diagnostic accuracy by incorporating fine-grained efficiency analysis and rubric-based assessment of diagnostic quality. Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs, offering a more nuanced and clinically meaningful evaluation paradigm.