Do LLMs Surpass Encoders for Biomedical NER?

📅 2025-04-01

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Balancing performance and efficiency in biomedical named entity recognition (NER) remains challenging, particularly when comparing large language models (LLMs) against specialized Transformer encoders. Method: This study systematically evaluates LLMs (e.g., Mistral, Llama-8B) and encoder-based models (e.g., BERT, BioMedBERT, DeBERTa-v3) across five biomedical NER datasets. Crucially, all datasets are uniformly reformatted into BIO tagging to preserve positional information, and annotation consistency is rigorously controlled—enabling the first direct assessment of LLMs’ ability to model structured outputs. Contribution/Results: On four datasets, Mistral and Llama-8B achieve 2–8% higher F1 scores than the best encoder-based models, with pronounced gains for long entities (≥3 tokens); performance remains comparable on the fifth. However, this improvement incurs 10–100× higher inference latency and significantly increased hardware costs. The findings demonstrate that LLMs offer distinct advantages in biomedical NER—especially for complex, multi-token entities—but their adoption necessitates careful trade-offs between accuracy gains and computational overhead.

Technology Category

Application Category

📝 Abstract

Recognizing spans of biomedical concepts and their types (e.g., drug or gene) in free text, often called biomedical named entity recognition (NER), is a basic component of information extraction (IE) pipelines. Without a strong NER component, other applications, such as knowledge discovery and information retrieval, are not practical. State-of-the-art in NER shifted from traditional ML models to deep neural networks with transformer-based encoder models (e.g., BERT) emerging as the current standard. However, decoder models (also called large language models or LLMs) are gaining traction in IE. But LLM-driven NER often ignores positional information due to the generative nature of decoder models. Furthermore, they are computationally very expensive (both in inference time and hardware needs). Hence, it is worth exploring if they actually excel at biomedical NER and assess any associated trade-offs (performance vs efficiency). This is exactly what we do in this effort employing the same BIO entity tagging scheme (that retains positional information) using five different datasets with varying proportions of longer entities. Our results show that the LLMs chosen (Mistral and Llama: 8B range) often outperform best encoder models (BERT-(un)cased, BiomedBERT, and DeBERTav3: 300M range) by 2-8% in F-scores except for one dataset, where they equal encoder performance. This gain is more prominent among longer entities of length>= 3 tokens. However, LLMs are one to two orders of magnitude more expensive at inference time and may need cost prohibitive hardware. Thus, when performance differences are small or real time user feedback is needed, encoder models might still be more suitable than LLMs.

Problem

Research questions and friction points this paper is trying to address.

Assessing if LLMs outperform encoders in biomedical NER tasks.

Evaluating performance vs efficiency trade-offs of LLMs in NER.

Comparing positional information retention in LLM-driven biomedical NER.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs outperform encoders in biomedical NER

LLMs retain positional information via BIO tagging

LLMs are computationally expensive but more accurate

🔎 Similar Papers

No similar papers found.