🤖 AI Summary
The applicability of large language models (LLMs) in biomedical natural language processing (BioNLP) remains insufficiently systematized. Method: We conduct the first comprehensive benchmarking study evaluating four major LLM families—GPT, LLaMA, and others—across 12 BioNLP tasks (e.g., MedQA, BC5CDR) under zero-shot, few-shot, and supervised fine-tuning settings, contrasting them against traditional BERT/BART fine-tuning. We further integrate hallucination detection, information omission analysis, and computational cost modeling. Contribution/Results: GPT-4 achieves superior performance on medical reasoning tasks; open-source LLMs, when fine-tuned, approach the performance of closed-source counterparts; yet traditional fine-tuning still outperforms zero-/few-shot LLMs on most tasks. Our analysis delineates the performance boundaries, critical failure modes (e.g., hallucination, knowledge gaps), and cost–accuracy trade-offs of LLMs in BioNLP, yielding actionable, clinically oriented deployment guidelines.
📝 Abstract
The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs—GPT and LLaMA representatives—on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP.