What Does Neuro Mean to Cardio? Investigating the Role of Clinical Specialty Data in Medical LLMs

📅 2025-05-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the impact of clinical specialty-specific data on the performance of medical large language models (LLMs), empirically testing the “specialty-data injection → specialty-capability enhancement” hypothesis in medical question answering. We construct S-MedQA, a fine-grained specialty-level QA benchmark, and conduct systematic evaluations via token-level probability analysis, cross-specialty generalization tests, and controlled fine-tuning experiments. Results reveal that specialty-specific fine-tuning does not consistently improve in-domain performance; instead, token probabilities for all medically relevant terms increase uniformly—indicating performance gains stem primarily from domain-level transfer rather than genuine specialty knowledge injection. To our knowledge, this is the first empirical study to refute the strong causal assumption that specialty data directly confers specialty capability. We propose a domain-transfer–dominant paradigm, challenging prevailing notions of knowledge injection in medical LLMs. The S-MedQA dataset and code are publicly released.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce S-MedQA, an English medical question-answering (QA) dataset for benchmarking large language models in fine-grained clinical specialties. We use S-MedQA to check the applicability of a popular hypothesis related to knowledge injection in the knowledge-intense scenario of medical QA, and show that: 1) training on data from a speciality does not necessarily lead to best performance on that specialty and 2) regardless of the specialty fine-tuned on, token probabilities of clinically relevant terms for all specialties increase consistently. Thus, we believe improvement gains come mostly from domain shifting (e.g., general to medical) rather than knowledge injection and suggest rethinking the role of fine-tuning data in the medical domain. We release S-MedQA and all code needed to reproduce all our experiments to the research community.
Problem

Research questions and friction points this paper is trying to address.

Evaluating medical LLM performance across clinical specialties
Assessing impact of specialty data on knowledge injection
Challenging assumptions about fine-tuning data in medicine
Innovation

Methods, ideas, or system contributions that make the work stand out.

Specialty dataset S-MedQA for medical QA benchmarking
Testing knowledge injection hypothesis in medical LLMs
Domain shifting improves performance more than knowledge injection
🔎 Similar Papers
No similar papers found.