🤖 AI Summary
This study addresses the lack of systematic evaluation methods for large language models (LLMs) in the clinical pre-consultation phase. The authors propose EPAG, the first pre-consultation assessment framework that integrates clinical practice guidelines, and construct a high-quality benchmark dataset. Model performance is comprehensively evaluated through direct alignment between patient chief complaints with medical guidelines and indirect assessment via diagnostic accuracy. Experimental results demonstrate that smaller models fine-tuned on domain-specific data can outperform state-of-the-art large models. Furthermore, the study reveals no significant positive correlation between the length of patient complaints and diagnostic performance, highlighting a nonlinear relationship between input length and clinical efficacy. The authors release their dataset and evaluation pipeline to advance the reliable deployment of LLMs in clinical settings.
📝 Abstract
We introduce EPAG, a benchmark dataset and framework designed for Evaluating the Pre-consultation Ability of LLMs using diagnostic Guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.