Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare

📅 2025-04-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the three-class classification task in medical pathology reports. Methodologically, it systematically investigates language model selection and adaptation strategies, comparing small language models (SLMs) and large language models (LLMs) under zero-shot and fine-tuning paradigms, and evaluating the impact of domain-proximal pretraining, domain-specific pretraining, and supervised fine-tuning on this data-scarce, high-difficulty task. Results show that fine-tuned SLMs substantially outperform both zero-shot SLMs and LLMs; domain-proximal pretraining further improves performance, while domain-specific pretraining yields particularly notable gains on challenging subtasks. Crucially, SLMs achieve a superior trade-off among accuracy, inference efficiency, and deployment cost. The study contributes a reproducible model selection framework and a lightweight adaptation pipeline tailored for specialized medical NLP tasks.

Technology Category

Application Category

📝 Abstract
This study aims to guide language model selection by investigating: 1) the necessity of finetuning versus zero-shot usage, 2) the benefits of domain-adjacent versus generic pretrained models, 3) the value of further domain-specific pretraining, and 4) the continued relevance of Small Language Models (SLMs) compared to Large Language Models (LLMs) for specific tasks. Using electronic pathology reports from the British Columbia Cancer Registry (BCCR), three classification scenarios with varying difficulty and data size are evaluated. Models include various SLMs and an LLM. SLMs are evaluated both zero-shot and finetuned; the LLM is evaluated zero-shot only. Finetuning significantly improved SLM performance across all scenarios compared to their zero-shot results. The zero-shot LLM outperformed zero-shot SLMs but was consistently outperformed by finetuned SLMs. Domain-adjacent SLMs generally performed better than the generic SLM after finetuning, especially on harder tasks. Further domain-specific pretraining yielded modest gains on easier tasks but significant improvements on the complex, data-scarce task. The results highlight the critical role of finetuning for SLMs in specialized domains, enabling them to surpass zero-shot LLM performance on targeted classification tasks. Pretraining on domain-adjacent or domain-specific data provides further advantages, particularly for complex problems or limited finetuning data. While LLMs offer strong zero-shot capabilities, their performance on these specific tasks did not match that of appropriately finetuned SLMs. In the era of LLMs, SLMs remain relevant and effective, offering a potentially superior performance-resource trade-off compared to LLMs.
Problem

Research questions and friction points this paper is trying to address.

Guide language model choice for healthcare applications
Compare finetuned vs zero-shot performance of SLMs and LLMs
Evaluate domain-specific pretraining benefits for specialized tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Finetuning SLMs surpasses zero-shot LLMs.
Domain-adjacent pretraining boosts SLM performance.
SLMs offer better performance-resource trade-off.
L
Lovedeep Gondara
British Columbia Cancer Registry, Provincial Health Services Authority, Vancouver, Canada
Jonathan Simkin
Jonathan Simkin
Director, BC Cancer Registry
EpidemiologyMachine LearningNatural Language Processing
G
Graham Sayle
Data Science Institute, University of British Columbia, Vancouver, Canada
S
Shebnum Devji
British Columbia Cancer Registry, Provincial Health Services Authority, Vancouver, Canada
G
Gregory Arbour
Data Science Institute, University of British Columbia, Vancouver, Canada
Raymond Ng
Raymond Ng
University of British Columbia
data mininghealth informaticsgenomicsNLPtext mining