Delving into LLM-assisted writing in biomedical publications through excess vocabulary

📅 2024-06-11
📈 Citations: 16
Influential: 1
📄 PDF
🤖 AI Summary
This study quantifies the real-world penetration of large language models (LLMs) in biomedical academic writing. Method: Leveraging over 15 million PubMed abstracts published between 2010 and 2024, we propose the first unsupervised, large-scale LLM detection framework grounded in lexical anomaly distribution—identifying “overused words” characteristic of LLM-generated text to enable objective, cross-temporal, cross-disciplinary, and cross-regional assessment. Contribution/Results: LLM influence on scholarly writing now exceeds that of major historical events such as the COVID-19 pandemic; by 2024, at least 13.5% of biomedical abstracts show evidence of LLM involvement, rising to 40% in select subfields. This work provides the first empirical evidence establishing LLMs as the most salient external factor currently shaping scientific writing—delivering a critical benchmark for publishing ethics, quality assurance, and science policy formulation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) like ChatGPT can generate and revise text with human-level performance. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists use them for their scholarly writing. But how wide-spread is such LLM usage in the academic literature? To answer this question for the field of biomedical research, we present an unbiased, large-scale approach: we study vocabulary changes in over 15 million biomedical abstracts from 2010--2024 indexed by PubMed, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. This excess word analysis suggests that at least 13.5% of 2024 abstracts were processed with LLMs. This lower bound differed across disciplines, countries, and journals, reaching 40% for some subcorpora. We show that LLMs have had an unprecedented impact on scientific writing in biomedical research, surpassing the effect of major world events such as the Covid pandemic.
Problem

Research questions and friction points this paper is trying to address.

Impact of LLMs on biomedical writing
Detection of LLM usage in abstracts
Analysis of vocabulary changes post-LLM introduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-assisted biomedical writing analysis
Excess vocabulary tracking method
Large-scale PubMed abstract study
Dmitry Kobak
Dmitry Kobak
University of Tübingen
Machine LearningUnsupervised LearningManifold learningTranscriptomicsComputational Neuroscience
R
Rita González-Márquez
Hertie Institute for AI in Brain Health, University of Tübingen, Germany; Tübingen AI Center, Tübingen, Germany
Emőke-Ágnes Horvát
Emőke-Ágnes Horvát
Associate Professor, Northwestern University
Computational Social ScienceScience of ScienceComplex NetworksHuman-Centered Computing
J
Jan Lause
Hertie Institute for AI in Brain Health, University of Tübingen, Germany; Tübingen AI Center, Tübingen, Germany