Benchmarking and Adapting On-Device LLMs for Clinical Decision Support

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This study explores the efficient deployment of large language models (LLMs) in resource-constrained clinical settings while preserving patient privacy and eliminating reliance on cloud infrastructure. We present the first systematic evaluation of multiple open-source, edge-deployable LLMs—including Gemma-31B and Qwen3.5-35B—on both general and ophthalmology-specific diagnostic tasks, enhancing their clinical adaptability through fine-tuning. Our results demonstrate that fine-tuned Qwen3.5-35B achieves an accuracy of 87.9%, approaching the 89.4% performance of GPT-5.1. Notably, 87.2% of its errors correspond to clinically reasonable differential diagnoses, and upper-bound analysis suggests a potential accuracy of up to 93.2%. These findings substantiate the feasibility and promise of on-device large language models for real-world clinical decision support.

📝 Abstract

Large language models (LLMs) have rapidly advanced in clinical decision-making, yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure. Open-source alternatives allow local inference but often have large model sizes that limit their use in resource-constrained clinical settings. Here, we benchmark on-device LLMs from the gpt-oss (20b, 120b), Qwen3.5 (9B, 27B, 35B), and Gemma 4 (31B) families across three representative clinical tasks: general disease diagnosis, specialty-specific (ophthalmology) diagnosis and management, and simulation of human expert grading and evaluation. We compare their performance with state-of-the-art proprietary models (GPT-5.1, GPT-5-mini, and Gemini 3.1 Pro) and a leading open-source model (DeepSeek-R1), and we further evaluate the adaptability of on-device systems by fine-tuning gpt-oss-20b and Qwen3.5-35B on general diagnostic data. Across tasks, on-device models achieve performance comparable to or exceeding DeepSeek-R1 and GPT-5-mini despite being substantially smaller. In addition, fine-tuning remarkably improves diagnostic accuracy, with the fine-tuned Qwen3.5-35B reaching 87.9% and approaching the proprietary GPT-5.1 (89.4%). Among base on-device models, Gemma 4 31B achieved the strongest general diagnostic accuracy at 86.5%, exceeding GPT-5-mini and approaching the fine-tuned Qwen3.5-35B. Error characterization revealed that 87.2% of diagnostic errors across all models were clinically plausible differentials rather than off-topic predictions, and upper-bound analysis showed up to 93.2% attainable accuracy through improved answer selection. These findings highlight the potential of on-device LLMs to deliver accurate, adaptable, and privacy-preserving clinical decision support, offering a practical pathway for broader integration of LLMs into routine clinical practice.

Problem

Research questions and friction points this paper is trying to address.

on-device LLMs

clinical decision support

privacy-preserving AI

resource-constrained deployment

medical diagnosis

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-device LLMs

clinical decision support

model fine-tuning