The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models

πŸ“… 2024-11-13
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Prior work often assumes domain-specific pretraining inherently improves large language models (LLMs) and vision-language models (VLMs) for medical question answering, yet rigorous empirical validation remains scarce. Method: We systematically evaluate medically adapted LLMs and VLMs on biomedical QA tasks, continuing pretraining on public biomedical corpora and employing paired baseline comparisons, independent prompt optimization, and strict statistical significance testing (e.g., paired t-tests). Contribution/Results: Our analysis reveals that the marginal gains of medical pretraining are widely overestimated: in 3-shot clinical note QA, only 26.7% of medical LLMs significantly outperform their base counterparts, while 56.7% perform significantly worse; no medical VLM achieves consistent cross-dataset improvement. We introduce a reproducible attribution evaluation framework that challenges the implicit β€œdomain pretraining always helps” assumption, providing methodological caution and an empirically grounded benchmark for medical foundation model development.

Technology Category

Application Category

πŸ“ Abstract
Several recent works seek to adapt general-purpose large language models (LLMs) and vision-language models (VLMs) for medical applications through continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining improves performance on various downstream medical tasks, such as answering medical exam questions. In this paper, we compare ten"medical"LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting and supervised fine-tuning regimes for medical question answering (QA). For instance, on clinical-note-based QA tasks in the 3-shot setting, medical LLMs outperform their base models in only 26.7% of cases, reach a (statistical) tie in 16.7% of cases, and perform significantly worse in the remaining 56.7% of cases. Our conclusions are based on (i) comparing each medical model directly against its base model; (ii) optimizing the prompts for each model separately in zero-/few-shot prompting; and (iii) accounting for statistical uncertainty in comparisons. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and offer recommendations to strengthen the conclusions of future studies.
Problem

Research questions and friction points this paper is trying to address.

Evaluates medical adaptation of LLMs and VLMs.
Compares medical models with base models.
Assesses performance on medical QA tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compares medical LLMs and VLMs with base models
Evaluates zero-/few-shot and supervised fine-tuning regimes
Assesses performance on medical QA tasks statistically
πŸ”Ž Similar Papers
No similar papers found.
D
Daniel P. Jeong
Machine Learning Department, Carnegie Mellon University
P
Pranav Mani
Abridge AI
S
Saurabh Garg
Mistral AI
Zachary C. Lipton
Zachary C. Lipton
Raj Reddy Associate Professor of Machine Learning @ Carnegie Mellon; Cofounder & CTO @ Abridge
Machine LearningHealthcareTechnology & SocietyNLPRobustness & Adaptivity
M
Michael Oberst
Department of Computer Science, Johns Hopkins University