Diagnosing our datasets: How does my language model learn clinical information?

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work investigates how open-source large language models (LLMs) acquire clinical knowledge from general pretraining corpora, focusing on two core challenges: clinical terminology comprehension and reliability in responding to unverified medical claims. Methodologically, it first identifies a significant distributional mismatch between clinical term frequencies in pretraining data and their usage patterns in real-world electronic health records (EHRs); then introduces MedLingo—a novel zero-shot clinical terminology understanding benchmark—and conducts corpus frequency analysis, source categorization, and cross-model–data composition correlation studies. Key contributions include: (1) empirical evidence that term frequency in pretraining corpora strongly correlates with model performance; (2) identification of widespread sparsity of high-frequency EHR terms in general-purpose corpora; and (3) detection of online sources—such as patient forums and non-expert health blogs—that disproportionately propagate medically unsubstantiated claims. These findings provide an evidence-based foundation and methodological framework for data provenance analysis, knowledge bottleneck diagnosis, and trustworthiness enhancement of clinical LLMs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have performed well across various clinical natural language processing tasks, despite not being directly trained on electronic health record (EHR) data. In this work, we examine how popular open-source LLMs learn clinical information from large mined corpora through two crucial but understudied lenses: (1) their interpretation of clinical jargon, a foundational ability for understanding real-world clinical notes, and (2) their responses to unsupported medical claims. For both use cases, we investigate the frequency of relevant clinical information in their corresponding pretraining corpora, the relationship between pretraining data composition and model outputs, and the sources underlying this data. To isolate clinical jargon understanding, we evaluate LLMs on a new dataset MedLingo. Unsurprisingly, we find that the frequency of clinical jargon mentions across major pretraining corpora correlates with model performance. However, jargon frequently appearing in clinical notes often rarely appears in pretraining corpora, revealing a mismatch between available data and real-world usage. Similarly, we find that a non-negligible portion of documents support disputed claims that can then be parroted by models. Finally, we classified and analyzed the types of online sources in which clinical jargon and unsupported medical claims appear, with implications for future dataset composition.

Problem

Research questions and friction points this paper is trying to address.

How LLMs learn clinical jargon from mined corpora

LLMs' responses to unsupported medical claims

Mismatch between pretraining data and real-world clinical usage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs on clinical jargon via MedLingo dataset

Analyzes pretraining data impact on clinical outputs

Classifies online sources of jargon and unsupported claims

🔎 Similar Papers

No similar papers found.