Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing bioinformatics benchmarks inadequately assess large language models’ (LLMs) real-world cross-domain capabilities. Method: We introduce Bio-benchmark—the first prompt-driven, multi-domain evaluation suite covering 30 tasks across proteins, RNA, drugs, electronic health records, and traditional Chinese medicine—enabling native zero-shot and few-shot chain-of-thought evaluation. We propose a unified evaluation framework, BioFinder—a high-accuracy answer extraction tool improving accuracy by ~30%—and domain-specific, knowledge-aware prompt engineering strategies tailored for biological reasoning. Results: Comprehensive experiments systematically characterize the capability boundaries of state-of-the-art models including GPT-4o and Llama-3.1-70B, revealing critical weaknesses in biological reasoning. Bio-benchmark establishes an empirically grounded, reproducible assessment paradigm to guide the development of biology-specialized LLMs.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have become important tools in solving biological problems, offering improvements in accuracy and adaptability over conventional methods. Several benchmarks have been proposed to evaluate the performance of these LLMs. However, current benchmarks can hardly evaluate the performance of these models across diverse tasks effectively. In this paper, we introduce a comprehensive prompting-based benchmarking framework, termed Bio-benchmark, which includes 30 key bioinformatics tasks covering areas such as proteins, RNA, drugs, electronic health records, and traditional Chinese medicine. Using this benchmark, we evaluate six mainstream LLMs, including GPT-4o and Llama-3.1-70b, etc., using 0-shot and few-shot Chain-of-Thought (CoT) settings without fine-tuning to reveal their intrinsic capabilities. To improve the efficiency of our evaluations, we demonstrate BioFinder, a new tool for extracting answers from LLM responses, which increases extraction accuracy by round 30% compared to existing methods. Our benchmark results show the biological tasks suitable for current LLMs and identify specific areas requiring enhancement. Furthermore, we propose targeted prompt engineering strategies for optimizing LLM performance in these contexts. Based on these findings, we provide recommendations for the development of more robust LLMs tailored for various biological applications. This work offers a comprehensive evaluation framework and robust tools to support the application of LLMs in bioinformatics.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs across diverse bioinformatics tasks effectively.
Introduce Bio-benchmark for 30 key bioinformatics tasks.
Propose strategies to optimize LLM performance in bioinformatics.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed Bio-benchmark for 30 bioinformatics tasks
Introduced BioFinder for efficient answer extraction
Proposed prompt engineering for LLM optimization
🔎 Similar Papers
No similar papers found.
J
Jiyue Jiang
The Chinese University of Hong Kong
P
Pengan Chen
The University of Hong Kong
J
Jiuming Wang
The Chinese University of Hong Kong
Dongchen He
Dongchen He
The Chinese University of Hong Kong
AI4SciBioinformatics
Z
Ziqin Wei
The Chinese University of Hong Kong
L
Liang Hong
The Chinese University of Hong Kong
Licheng Zong
Licheng Zong
The Chinese University of Hong Kong
AI for ScienceAI in HealthcareLarge Language ModelsMicrobiology
S
Sheng Wang
The University of Hong Kong
Q
Qinze Yu
The Chinese University of Hong Kong
Zixian Ma
Zixian Ma
University of Washington
Multi-modal models and agentshuman-agent interaction and collaboration
Y
Yanyu Chen
The Chinese University of Hong Kong
Yimin Fan
Yimin Fan
The Chinese University of Hong Kong
Single-cell genomicsFoundation Models
Xiangyu Shi
Xiangyu Shi
Head of Algorithm Department, Metaradio
Applications of Deep LearningCompute LinguistCompute BiologyWireless Commnucation
J
Jiawei Sun
Shanghai AI Lab
Chuan Wu
Chuan Wu
Professor of Computer Science, The University of Hong Kong
cloud computingdistributed machine learning algorithms and systems
Y
Yu Li
The Chinese University of Hong Kong