🤖 AI Summary
This paper addresses the lack of standardized, dynamic, interactive evaluation frameworks for large language models (LLMs) in intelligent outpatient referral (IOR) tasks. To this end, we propose the first dual-modal evaluation framework integrating static recommendation and dynamic dialogue optimization. Methodologically, we construct a structured benchmark grounded in multi-scale prompt engineering, dialogue trajectory modeling, and human calibration, covering major open- and closed-source LLMs (e.g., Llama, GPT series) alongside BERT-based baselines. Experimental results show that LLMs significantly outperform fine-tuned BERT in dynamic follow-up question generation quality, yet yield only marginal gains in static referral accuracy. Our key contributions are: (1) formalizing the core evaluation paradigm for IOR; (2) releasing the first structured benchmark and evaluation protocol specifically designed for outpatient referral; and (3) empirically characterizing the capability boundaries and applicable scenarios of LLMs in interactive clinical consultation.
📝 Abstract
Large language models (LLMs) are increasingly applied to outpatient referral tasks across healthcare systems. However, there is a lack of standardized evaluation criteria to assess their effectiveness, particularly in dynamic, interactive scenarios. In this study, we systematically examine the capabilities and limitations of LLMs in managing tasks within Intelligent Outpatient Referral (IOR) systems and propose a comprehensive evaluation framework specifically designed for such systems. This framework comprises two core tasks: static evaluation, which focuses on evaluating the ability of predefined outpatient referrals, and dynamic evaluation, which evaluates capabilities of refining outpatient referral recommendations through iterative dialogues. Our findings suggest that LLMs offer limited advantages over BERT-like models, but show promise in asking effective questions during interactive dialogues.