🤖 AI Summary
This study investigates the causal relationship between physician questioning quality and diagnostic accuracy in online medical consultations. To this end, we propose a high-fidelity patient simulator integrating (i) strategy extraction from real physician–patient dialogues, (ii) dynamic response modeling driven by longitudinal electronic health records, and (iii) a multi-turn question–diagnosis co-evaluation framework. Our key contributions include: (i) the first empirical identification of Liebig’s Law in clinical reasoning—i.e., overall diagnostic accuracy is constrained by the weakest questioning step; (ii) a four-category taxonomy of questioning behaviors, revealing systematic biases in large language models (LLMs) during elicitation of accompanying symptoms and medical history; and (iii) quantitative evidence that questioning quality constitutes a critical bottleneck in diagnostic performance, with state-of-the-art LLMs significantly underperforming human physicians in this phase. We publicly release the fully reproducible simulator, establishing a new paradigm for evaluating and optimizing AI-powered diagnostic assistants.
📝 Abstract
Online medical consultation (OMC) restricts doctors to gathering patient information solely through inquiries, making the already complex sequential decision-making process of diagnosis even more challenging. Recently, the rapid advancement of large language models has demonstrated a significant potential to transform OMC. However, most studies have primarily focused on improving diagnostic accuracy under conditions of relatively sufficient information, while paying limited attention to the"inquiry"phase of the consultation process. This lack of focus has left the relationship between"inquiry"and"diagnosis"insufficiently explored. In this paper, we first extract real patient interaction strategies from authentic doctor-patient conversations and use these strategies to guide the training of a patient simulator that closely mirrors real-world behavior. By inputting medical records into our patient simulator to simulate patient responses, we conduct extensive experiments to explore the relationship between"inquiry"and"diagnosis"in the consultation process. Experimental results demonstrate that inquiry and diagnosis adhere to the Liebig's law: poor inquiry quality limits the effectiveness of diagnosis, regardless of diagnostic capability, and vice versa. Furthermore, the experiments reveal significant differences in the inquiry performance of various models. To investigate this phenomenon, we categorize the inquiry process into four types: (1) chief complaint inquiry; (2) specification of known symptoms; (3) inquiry about accompanying symptoms; and (4) gathering family or medical history. We analyze the distribution of inquiries across the four types for different models to explore the reasons behind their significant performance differences. We plan to open-source the weights and related code of our patient simulator at https://github.com/LIO-H-ZEN/PatientSimulator.