🤖 AI Summary
Existing evaluations of large language models (LLMs) in epilepsy diagnosis over-rely on structured question-answering, failing to reflect real-world clinical reasoning from unstructured seizure narratives. Method: We introduce SemioLLM, a novel evaluation framework grounded in 1,269 authentic, free-text epilepsy seizure reports, systematically benchmarking six state-of-the-art LLMs on localizing seizure onset zones. Contribution/Results: This work establishes the first diagnostic reasoning evaluation paradigm for unstructured clinical narratives; reveals significant impacts of clinical role prompting, input text length, and context window size on model performance; and uncovers a critical trade-off—high accuracy often co-occurs with hallucination and poor evidence grounding. Through expert-informed chain-of-thought prompting, clinical scenario simulation, and probabilistic output calibration, GPT-4 and Qwen-72B achieve near-expert clinician-level accuracy. Our framework provides a scalable, clinically grounded benchmark for assessing interpretability and real-world deployability of medical AI.
📝 Abstract
Large Language Models (LLMs) have been shown to encode clinical knowledge. Many evaluations, however, rely on structured question-answer benchmarks, overlooking critical challenges of interpreting and reasoning about unstructured clinical narratives in real-world settings. Using free-text clinical descriptions, we present SemioLLM, an evaluation framework that benchmarks 6 state-of-the-art models (GPT-3.5, GPT-4, Mixtral-8x7B, Qwen-72B, LlaMa2, LlaMa3) on a core diagnostic task in epilepsy. Leveraging a database of 1,269 seizure descriptions, we show that most LLMs are able to accurately and confidently generate probabilistic predictions of seizure onset zones in the brain. Most models approach clinician-level performance after prompt engineering, with expert-guided chain-of-thought reasoning leading to the most consistent improvements. Performance was further strongly modulated by clinical in-context impersonation, narrative length and language context (13.7%, 32.7% and 14.2% performance variation, respectively). However, expert analysis of reasoning outputs revealed that correct prediction can be based on hallucinated knowledge and deficient source citation accuracy, underscoring the need to improve interpretability of LLMs in clinical use. Overall, SemioLLM provides a scalable, domain-adaptable framework for evaluating LLMs in clinical disciplines where unstructured verbal descriptions encode diagnostic information. By identifying both the strengths and limitations of state-of-the-art models, our work supports the development of clinically robust and globally applicable AI systems for healthcare.