An evaluation of DeepSeek Models in Biomedical Natural Language Processing

📅 2025-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Despite growing adoption of large language models (LLMs) in biomedicine, systematic evaluation of their generalization across core biomedical NLP tasks remains limited. Method: This study presents the first comprehensive assessment of DeepSeek models—Distilled-DeepSeek-R1 and Deepseek-LLMs—across four fundamental biomedical NLP tasks: named entity recognition (NER), relation extraction (RE), event extraction (EE), and text classification, using 12 standard benchmarks. We employ a zero-shot and few-shot multi-task evaluation framework, reporting standardized F1, precision, and recall scores, and compare against strong baselines including Llama3-8B and Qwen2.5-7B. Results: DeepSeek achieves competitive or state-of-the-art performance on NER and text classification; however, it exhibits a pronounced precision–recall trade-off in RE and EE, yielding average F1 scores 3.2–5.7 percentage points lower than top performers. We release a reproducible biomedical LLM evaluation benchmark and propose task-specific model selection guidelines, offering empirical foundations for domain adaptation and distillation optimization.

Technology Category

Application Category

📝 Abstract
The advancement of Large Language Models (LLMs) has significantly impacted biomedical Natural Language Processing (NLP), enhancing tasks such as named entity recognition, relation extraction, event extraction, and text classification. In this context, the DeepSeek series of models have shown promising potential in general NLP tasks, yet their capabilities in the biomedical domain remain underexplored. This study evaluates multiple DeepSeek models (Distilled-DeepSeek-R1 series and Deepseek-LLMs) across four key biomedical NLP tasks using 12 datasets, benchmarking them against state-of-the-art alternatives (Llama3-8B, Qwen2.5-7B, Mistral-7B, Phi-4-14B, Gemma-2-9B). Our results reveal that while DeepSeek models perform competitively in named entity recognition and text classification, challenges persist in event and relation extraction due to precision-recall trade-offs. We provide task-specific model recommendations and highlight future research directions. This evaluation underscores the strengths and limitations of DeepSeek models in biomedical NLP, guiding their future deployment and optimization.
Problem

Research questions and friction points this paper is trying to address.

Evaluate DeepSeek models in biomedical NLP tasks.
Assess performance in named entity recognition and text classification.
Identify challenges in event and relation extraction tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates DeepSeek models in biomedical NLP tasks
Benchmarks against state-of-the-art LLMs like Llama3-8B
Identifies precision-recall trade-offs in event extraction
🔎 Similar Papers
No similar papers found.
Zaifu Zhan
Zaifu Zhan
PhD at University of Minnesota, MS at Tsinghua University
Natural language processingMachine LearningAI for BiomedicineLarge Language model
S
Shuang Zhou
University of Minnesota Twin Cities, Minneapolis, MN, USA
Huixue Zhou
Huixue Zhou
PhD candidate at University of Minnesota
Natural Language ProcessingHealth Informatics
Jiawen Deng
Jiawen Deng
University of Electronic Science and Technology of China
NLPAI SafetyAffective Computing
Y
Yu Hou
University of Minnesota Twin Cities, Minneapolis, MN, USA
J
Jeremy Yeung
University of Minnesota Twin Cities, Minneapolis, MN, USA
R
Rui Zhang
University of Minnesota Twin Cities, Minneapolis, MN, USA