Benchmarking large language models for biomedical natural language processing applications and recommendations

📅 2023-05-10
🏛️ Nature Communications
📈 Citations: 41
Influential: 1
📄 PDF
🤖 AI Summary
The applicability of large language models (LLMs) in biomedical natural language processing (BioNLP) remains insufficiently systematized. Method: We conduct the first comprehensive benchmarking study evaluating four major LLM families—GPT, LLaMA, and others—across 12 BioNLP tasks (e.g., MedQA, BC5CDR) under zero-shot, few-shot, and supervised fine-tuning settings, contrasting them against traditional BERT/BART fine-tuning. We further integrate hallucination detection, information omission analysis, and computational cost modeling. Contribution/Results: GPT-4 achieves superior performance on medical reasoning tasks; open-source LLMs, when fine-tuned, approach the performance of closed-source counterparts; yet traditional fine-tuning still outperforms zero-/few-shot LLMs on most tasks. Our analysis delineates the performance boundaries, critical failure modes (e.g., hallucination, knowledge gaps), and cost–accuracy trade-offs of LLMs in BioNLP, yielding actionable, clinically oriented deployment guidelines.
📝 Abstract
The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs—GPT and LLaMA representatives—on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' effectiveness in biomedical NLP tasks
Comparing traditional fine-tuning vs zero-shot/few-shot LLM performance
Identifying LLM limitations like hallucinations in biomedical outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic evaluation of four LLMs on BioNLP benchmarks
Comparison of zero-shot, few-shot, and fine-tuning performance
Analysis of inconsistencies, hallucinations, and cost efficiency
🔎 Similar Papers
No similar papers found.
Qingyu Chen
Qingyu Chen
Biomedical Informatics & Data Science, Yale University; NCBI-NLM, National Institutes of Health
Text miningMachine learningData curationBioNLPMedical Imaging Analysis
J
Jingcheng Du
School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, USA
Y
Yan Hu
School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, USA
V
V. Keloth
Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, USA
Xueqing Peng
Xueqing Peng
Yale University
K
Kalpana Raja
Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, USA
R
Rui Zhang
Department of Surgery, Minneapolis, School of Medicine, University of Minnesota, Minneapolis, USA
Zhiyong Lu
Zhiyong Lu
Senior Investigator, NLM; Adjunct Professor of CS, UIUC
BioNLPBiomedical InformaticsMedical AIArtificial Intelligence
H
Huan Xu
Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, USA