🤖 AI Summary
This study addresses the challenges of extracting clinically actionable tasks from discharge summaries, which stem from textual complexity and inconsistent annotations. The authors propose a two-stage prompting framework that decomposes narrative text into fine-grained, explicitly actionable tasks and present the first systematic comparison between general-purpose large language models (LLMs) and specialized supervised BERT-based models on this task. Experimental results show that LLMs match or surpass supervised models in binary actionability detection but still lag behind in fine-grained multi-label classification. Qualitative analysis reveals that model failures primarily arise from the absence of explicit reasoning rationales in existing annotations, highlighting the critical need for datasets annotated with reasoning chains to support more robust clinical decision-making.
📝 Abstract
The work in this paper evaluates zero-shot and few-shot large language models (LLMs) for safety-critical clinical action extraction using the CLIP discharge-note dataset, with particular emphasis on transitions of care and post-discharge patient safety. To manage the complexity of clinical documentation, we introduce a two-stage extraction framework that decomposes discharge notes, that are written in narrative form, into fine-grained, explicitly actionable clinical tasks through a staged prompting strategy. Our contributions include a systematic assessment of generative LLMs for clinical action extraction, a detailed comparison between general-purpose LLMs and task-specific supervised BERT-based models, and an analysis of annotation inconsistencies across different action categories. We show that contemporary LLMs achieve performance comparable to or exceeding supervised models on binary actionability detection, while supervised baselines retain a meaningful advantage on fine-grained multi-label category classification, despite the absence of task-specific fine-tuning and under strict data-privacy constraints. Qualitative error analysis reveals that many failures stem from misalignment between model reasoning and dataset annotation conventions, particularly in cases involving implicit clinical actions and rigid structural labeling rules. These results indicate that reported performance reflects model limitations due to lack of clinical reasoning, that is not captured by plain annotations. Labels without rationales make it impossible to distinguish clinical reasoning failures from annotation convention mismatches. Advancing clinical NLP requires reasoning-annotated datasets that document why specific spans are actionable, not merely which spans were labeled, enabling proper evaluation of model clinical understanding.