BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models

📅 2026-01-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses critical limitations in existing biomedical question-answering benchmarks, which are often static, outdated, and prone to data leakage, while lacking systematic evaluation of model factuality, robustness, and bias. To overcome these issues, the authors introduce a dynamic evaluation benchmark constructed from up-to-date drug labels, clinical trial protocols, and clinical guidelines, comprising 2,280 expert-validated question-answer pairs along with perturbed variants (e.g., paraphrases and spelling errors). This benchmark enables the first time-sensitive, multi-source, and expert-verified assessment of biomedical QA systems. Experiments reveal that GPT-o1 achieves a relaxed F1 score of 0.92 on drug label questions, while clinical trial queries remain highly challenging (extractive F1: 0.36). Models exhibit greater sensitivity to semantic perturbations than to spelling errors, and demographic bias shows minimal impact. The proposed benchmark establishes a new standard for evaluating the reliability of large language models in real-world clinical settings.

Technology Category

Application Category

📝 Abstract
Objective: Large language models (LLMs) are increasingly applied in biomedical settings, and existing benchmark datasets have played an important role in supporting model development and evaluation. However, these benchmarks often have limitations. Many rely on static or outdated datasets that fail to capture the dynamic, context-rich, and high-stakes nature of biomedical knowledge. They also carry increasing risk of data leakage due to overlap with model pretraining corpora and often overlook critical dimensions such as robustness to linguistic variation and potential demographic biases. Materials and Methods: To address these gaps, we introduce BioPulse-QA, a benchmark that evaluates LLMs on answering questions from newly published biomedical documents including drug labels, trial protocols, and clinical guidelines. BioPulse-QA includes 2,280 expert-verified question answering (QA) pairs and perturbed variants, covering both extractive and abstractive formats. We evaluate four LLMs - GPT-4o, GPT-o1, Gemini-2.0-Flash, and LLaMA-3.1 8B Instruct - released prior to the publication dates of the benchmark documents. Results: GPT-o1 achieves the highest relaxed F1 score (0.92), followed by Gemini-2.0-Flash (0.90) on drug labels. Clinical trials are the most challenging source, with extractive F1 scores as low as 0.36. Discussion and Conclusion: Performance differences are larger for paraphrasing than for typographical errors, while bias testing shows negligible differences. BioPulse-QA provides a scalable and clinically relevant framework for evaluating biomedical LLMs.
Problem

Research questions and friction points this paper is trying to address.

biomedical question answering
large language models
benchmark limitations
factuality
bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic benchmark
biomedical QA
factuality evaluation
robustness to linguistic variation
bias assessment
🔎 Similar Papers
No similar papers found.
K
Kriti Bhattarai
Department of Biomedical Informatics and Data Science, Yale University, New Haven, CT, USA
V
V. Keloth
Department of Biomedical Informatics and Data Science, Yale University, New Haven, CT, USA
D
Donald Wright
Department of Biomedical Informatics and Data Science, Yale University, New Haven, CT, USA
A
Andrew Loza
Department of Biomedical Informatics and Data Science, Yale University, New Haven, CT, USA
Y
Yang Ren
Department of Biomedical Informatics and Data Science, Yale University, New Haven, CT, USA
Hua Xu
Hua Xu
Robert T. McCluskey Professor, Section of Biomedical Informatics and Data Science, Yale University
natural language processingtext mining