Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
General-purpose large language models lack clinical domain expertise required for hospital operations decision-making—e.g., patient flow optimization, cost containment, and care quality improvement. Method: We introduce Lang1, a medical operations–specific language model pretrained on electronic health records (EHRs), and propose ReMedE, a real-world evaluation benchmark covering five core tasks: readmission, mortality, length of stay, ICU admission, and discharge disposition. Lang1 employs hybrid pretraining on clinical texts and web data, followed by supervised fine-tuning and multi-task learning, and is rigorously evaluated under both zero-shot and fine-tuned settings. Contribution/Results: Domain-specific pretraining proves critical: Lang1-1B significantly outperforms fine-tuned general-purpose models with >10× more parameters (AUROC gains of 3.64–6.75% across tasks) and demonstrates strong external generalization. These results validate the efficacy of the “domain pretraining + task-adaptive fine-tuning” paradigm for AI-driven healthcare operations.

Technology Category

Application Category

📝 Abstract
Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.
Problem

Research questions and friction points this paper is trying to address.

General foundation models lack specialized knowledge for hospital operational decisions
Existing models underperform on critical clinical tasks like readmission and mortality prediction
Healthcare AI requires domain-specific pretraining and real-world evaluation beyond benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretrained models on specialized EHR and internet corpus
Developed realistic medical evaluation benchmark ReMedE
Combined in-domain pretraining with supervised finetuning
🔎 Similar Papers
No similar papers found.
L
Lavender Y. Jiang
Courant Institute School of Mathematics, Computing, and Data Science, New York University, 60 5th Ave, New York, 10001, NY, USA.
Angelica Chen
Angelica Chen
New York University
NLPdeep learning
X
Xu Han
Department of Neurosurgery, NYU Langone Health, 550 First Avenue, New York, 10016, NY, USA.
X
Xujin Chris Liu
Department of Neurosurgery, NYU Langone Health, 550 First Avenue, New York, 10016, NY, USA.
R
Radhika Dua
Courant Institute School of Mathematics, Computing, and Data Science, New York University, 60 5th Ave, New York, 10001, NY, USA.
K
Kevin Eaton
Grossman School of Medicine, New York University, 550 First Avenue, New York, 10016, NY, USA.
F
Frederick Wolff
Department of Neurosurgery, NYU Langone Health, 550 First Avenue, New York, 10016, NY, USA.
Robert Steele
Robert Steele
Department of Neurosurgery, NYU Langone Health, 550 First Avenue, New York, 10016, NY, USA.
J
Jeff Zhang
Division of Applied AI Technologies, NYU Langone Health, 227 East 30th Street, New York, 10016, NY, USA.
Anton Alyakin
Anton Alyakin
medical student at washington univesity
llmsneurosurgerynetworkscausality
Q
Qingkai Pan
Department of Neurosurgery, NYU Langone Health, 550 First Avenue, New York, 10016, NY, USA.
Y
Yanbing Chen
Department of Neurosurgery, NYU Langone Health, 550 First Avenue, New York, 10016, NY, USA.
Karl L. Sangwon
Karl L. Sangwon
Medical Student at NYU Grossman School of Medicine
NeurosurgeryApplied Math
D
Daniel A. Alber
Department of Neurosurgery, NYU Langone Health, 550 First Avenue, New York, 10016, NY, USA.
J
Jaden Stryker
Department of Neurosurgery, NYU Langone Health, 550 First Avenue, New York, 10016, NY, USA.
J
Jin Vivian Lee
Department of Neurosurgery, NYU Langone Health, 550 First Avenue, New York, 10016, NY, USA.
Y
Yindalon Aphinyanaphongs
Department of Medicine, NYU Langone Health, 550 First Avenue, New York, 10019, NY, USA.
Kyunghyun Cho
Kyunghyun Cho
New York University, Genentech
Machine LearningDeep Learning
Eric Karl Oermann
Eric Karl Oermann
New York University
Artificial IntelligenceHuman Intelligence