Benchmarking large language models for biomedical natural language processing applications and recommendations

📅 2023-05-10

🏛️ Nature Communications

📈 Citations: 41

✨ Influential: 1

career value

192K/year

🤖 AI Summary

The applicability of large language models (LLMs) in biomedical natural language processing (BioNLP) remains insufficiently systematized. Method: We conduct the first comprehensive benchmarking study evaluating four major LLM families—GPT, LLaMA, and others—across 12 BioNLP tasks (e.g., MedQA, BC5CDR) under zero-shot, few-shot, and supervised fine-tuning settings, contrasting them against traditional BERT/BART fine-tuning. We further integrate hallucination detection, information omission analysis, and computational cost modeling. Contribution/Results: GPT-4 achieves superior performance on medical reasoning tasks; open-source LLMs, when fine-tuned, approach the performance of closed-source counterparts; yet traditional fine-tuning still outperforms zero-/few-shot LLMs on most tasks. Our analysis delineates the performance boundaries, critical failure modes (e.g., hallucination, knowledge gaps), and cost–accuracy trade-offs of LLMs in BioNLP, yielding actionable, clinically oriented deployment guidelines.

📝 Abstract

The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs—GPT and LLaMA representatives—on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' effectiveness in biomedical NLP tasks

Comparing traditional fine-tuning vs zero-shot/few-shot LLM performance

Identifying LLM limitations like hallucinations in biomedical outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic evaluation of four LLMs on BioNLP benchmarks

Comparison of zero-shot, few-shot, and fine-tuning performance

Analysis of inconsistencies, hallucinations, and cost efficiency

🔎 Similar Papers

No similar papers found.