🤖 AI Summary
Current large language models (LLMs) lack systematic evaluation on high-precision, highly procedural scientific texts—particularly biological experimental protocols—hindering their deployment in life sciences where reproducibility and safety are paramount.
Method: We introduce BioProt, the first large-scale, multi-task benchmark for biological protocol understanding and reasoning, comprising 556K high-quality structured samples derived from 27K real-world protocols. It covers five core tasks: question answering, step ordering, error correction, protocol generation, and causal reasoning. Data quality is ensured via human verification and rule-based validation.
Contribution/Results: BioProt provides the first systematic definition and empirical assessment of LLMs’ deep procedural reasoning capabilities over programmatic biological text. Experiments across 12 state-of-the-art open- and closed-source models reveal critical bottlenecks—e.g., step ordering accuracy peaks at only 68.2%. Notably, certain open-source models match top proprietary models on specific tasks, highlighting promising avenues for domain-adapted LLM development.
📝 Abstract
Biological protocols are fundamental to reproducible and safe life science research. While LLMs excel on general tasks, their systematic evaluation on these highly specialized, accuracy-critical, and inherently procedural texts remains limited. In this work, we present BioProBench, the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning. While limited benchmarks have touched upon specific aspects like protocol QA, BioProBench provides a comprehensive suite of five core tasks: Protocol Question Answering, Step Ordering, Error Correction, Protocol Generation, and Protocol Reasoning, enabling a holistic evaluation of LLMs on procedural biological texts. Built upon 27K original protocols, it yields nearly 556K high-quality structured instances. We evaluate 12 mainstream open/closed-source LLMs on BioProBench. Experimental results reveal that while top models preform well on surface understanding tasks, struggle significantly with deep reasoning and structured generation tasks like ordering and generation. Furthermore, model comparisons reveal diverse performance: certain open-source models approach closed-source levels on some tasks, yet bio-specific small models lag behind general LLMs, indicating limitations on complex procedural content. Overall, our findings underscore that procedural reasoning within biological protocols represents a significant challenge for current LLMs. BioProBench serves as a standardized framework to diagnose these specific limitations and guide the development of AI systems better equipped for safely automating complex scientific procedures. The code and data are available at: https://github.com/YuyangSunshine/bioprotocolbench and https://huggingface.co/datasets/GreatCaptainNemo/BioProBench.