π€ AI Summary
Prior work often assumes domain-specific pretraining inherently improves large language models (LLMs) and vision-language models (VLMs) for medical question answering, yet rigorous empirical validation remains scarce.
Method: We systematically evaluate medically adapted LLMs and VLMs on biomedical QA tasks, continuing pretraining on public biomedical corpora and employing paired baseline comparisons, independent prompt optimization, and strict statistical significance testing (e.g., paired t-tests).
Contribution/Results: Our analysis reveals that the marginal gains of medical pretraining are widely overestimated: in 3-shot clinical note QA, only 26.7% of medical LLMs significantly outperform their base counterparts, while 56.7% perform significantly worse; no medical VLM achieves consistent cross-dataset improvement. We introduce a reproducible attribution evaluation framework that challenges the implicit βdomain pretraining always helpsβ assumption, providing methodological caution and an empirically grounded benchmark for medical foundation model development.
π Abstract
Several recent works seek to adapt general-purpose large language models (LLMs) and vision-language models (VLMs) for medical applications through continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining improves performance on various downstream medical tasks, such as answering medical exam questions. In this paper, we compare ten"medical"LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting and supervised fine-tuning regimes for medical question answering (QA). For instance, on clinical-note-based QA tasks in the 3-shot setting, medical LLMs outperform their base models in only 26.7% of cases, reach a (statistical) tie in 16.7% of cases, and perform significantly worse in the remaining 56.7% of cases. Our conclusions are based on (i) comparing each medical model directly against its base model; (ii) optimizing the prompts for each model separately in zero-/few-shot prompting; and (iii) accounting for statistical uncertainty in comparisons. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and offer recommendations to strengthen the conclusions of future studies.