🤖 AI Summary
Despite growing adoption of large language models (LLMs) in biomedicine, systematic evaluation of their generalization across core biomedical NLP tasks remains limited. Method: This study presents the first comprehensive assessment of DeepSeek models—Distilled-DeepSeek-R1 and Deepseek-LLMs—across four fundamental biomedical NLP tasks: named entity recognition (NER), relation extraction (RE), event extraction (EE), and text classification, using 12 standard benchmarks. We employ a zero-shot and few-shot multi-task evaluation framework, reporting standardized F1, precision, and recall scores, and compare against strong baselines including Llama3-8B and Qwen2.5-7B. Results: DeepSeek achieves competitive or state-of-the-art performance on NER and text classification; however, it exhibits a pronounced precision–recall trade-off in RE and EE, yielding average F1 scores 3.2–5.7 percentage points lower than top performers. We release a reproducible biomedical LLM evaluation benchmark and propose task-specific model selection guidelines, offering empirical foundations for domain adaptation and distillation optimization.
📝 Abstract
The advancement of Large Language Models (LLMs) has significantly impacted biomedical Natural Language Processing (NLP), enhancing tasks such as named entity recognition, relation extraction, event extraction, and text classification. In this context, the DeepSeek series of models have shown promising potential in general NLP tasks, yet their capabilities in the biomedical domain remain underexplored. This study evaluates multiple DeepSeek models (Distilled-DeepSeek-R1 series and Deepseek-LLMs) across four key biomedical NLP tasks using 12 datasets, benchmarking them against state-of-the-art alternatives (Llama3-8B, Qwen2.5-7B, Mistral-7B, Phi-4-14B, Gemma-2-9B). Our results reveal that while DeepSeek models perform competitively in named entity recognition and text classification, challenges persist in event and relation extraction due to precision-recall trade-offs. We provide task-specific model recommendations and highlight future research directions. This evaluation underscores the strengths and limitations of DeepSeek models in biomedical NLP, guiding their future deployment and optimization.