🤖 AI Summary
Deploying large language models (LLMs) in emergency departments (EDs) faces critical bottlenecks—including constrained hardware resources, high operational costs, and heightened patient privacy risks. Method: This study investigates the viability of small language models (SLMs) for clinical decision support by constructing an ED-oriented comprehensive benchmark integrating MedMCQA, MedQA-4Options, and PubMedQA, and systematically evaluating both general-purpose and medically fine-tuned SLMs on information integration and rapid reasoning tasks. Contribution/Results: Counterintuitively, general-purpose SLMs—without medical domain fine-tuning—significantly outperform specialized medically fine-tuned models across multiple ED simulation tasks, challenging the prevailing paradigm that clinical AI necessitates domain-specific adaptation. These findings provide empirical evidence and a novel pathway for deploying lightweight, low-latency, privacy-preserving, and cost-efficient edge-based clinical AI systems in resource-constrained ED settings.
📝 Abstract
Large language models (LLMs) have become increasingly popular in medical domains to assist physicians with a variety of clinical and operational tasks. Given the fast-paced and high-stakes environment of emergency departments (EDs), small language models (SLMs), characterized by a reduction in parameter count compared to LLMs, offer significant potential due to their inherent reasoning capability and efficient performance. This enables SLMs to support physicians by providing timely and accurate information synthesis, thereby improving clinical decision-making and workflow efficiency. In this paper, we present a comprehensive benchmark designed to identify SLMs suited for ED decision support, taking into account both specialized medical expertise and broad general problem-solving capabilities. In our evaluations, we focus on SLMs that have been trained on a mixture of general-domain and medical corpora. A key motivation for emphasizing SLMs is the practical hardware limitations, operational cost constraints, and privacy concerns in the typical real-world deployments. Our benchmark datasets include MedMCQA, MedQA-4Options, and PubMedQA, with the medical abstracts dataset emulating tasks aligned with real ED physicians' daily tasks. Experimental results reveal that general-domain SLMs surprisingly outperform their medically fine-tuned counterparts across these diverse benchmarks for ED. This indicates that for ED, specialized medical fine-tuning of the model may not be required.