SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy

📅 2024-07-03

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing evaluations of large language models (LLMs) in epilepsy diagnosis over-rely on structured question-answering, failing to reflect real-world clinical reasoning from unstructured seizure narratives. Method: We introduce SemioLLM, a novel evaluation framework grounded in 1,269 authentic, free-text epilepsy seizure reports, systematically benchmarking six state-of-the-art LLMs on localizing seizure onset zones. Contribution/Results: This work establishes the first diagnostic reasoning evaluation paradigm for unstructured clinical narratives; reveals significant impacts of clinical role prompting, input text length, and context window size on model performance; and uncovers a critical trade-off—high accuracy often co-occurs with hallucination and poor evidence grounding. Through expert-informed chain-of-thought prompting, clinical scenario simulation, and probabilistic output calibration, GPT-4 and Qwen-72B achieve near-expert clinician-level accuracy. Our framework provides a scalable, clinically grounded benchmark for assessing interpretability and real-world deployability of medical AI.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have been shown to encode clinical knowledge. Many evaluations, however, rely on structured question-answer benchmarks, overlooking critical challenges of interpreting and reasoning about unstructured clinical narratives in real-world settings. Using free-text clinical descriptions, we present SemioLLM, an evaluation framework that benchmarks 6 state-of-the-art models (GPT-3.5, GPT-4, Mixtral-8x7B, Qwen-72B, LlaMa2, LlaMa3) on a core diagnostic task in epilepsy. Leveraging a database of 1,269 seizure descriptions, we show that most LLMs are able to accurately and confidently generate probabilistic predictions of seizure onset zones in the brain. Most models approach clinician-level performance after prompt engineering, with expert-guided chain-of-thought reasoning leading to the most consistent improvements. Performance was further strongly modulated by clinical in-context impersonation, narrative length and language context (13.7%, 32.7% and 14.2% performance variation, respectively). However, expert analysis of reasoning outputs revealed that correct prediction can be based on hallucinated knowledge and deficient source citation accuracy, underscoring the need to improve interpretability of LLMs in clinical use. Overall, SemioLLM provides a scalable, domain-adaptable framework for evaluating LLMs in clinical disciplines where unstructured verbal descriptions encode diagnostic information. By identifying both the strengths and limitations of state-of-the-art models, our work supports the development of clinically robust and globally applicable AI systems for healthcare.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for diagnostic reasoning from clinical narratives

Assessing LLM performance on epilepsy seizure onset prediction

Identifying hallucination and citation issues in clinical LLM outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs on unstructured epilepsy narratives

Uses expert-guided chain-of-thought reasoning

Leverages 1,269 seizure descriptions database

🔎 Similar Papers

Large Language Models for Disease Diagnosis: A Scoping Review