CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess large language models’ (LLMs) critical reasoning capabilities—particularly in identifying methodological limitations and performing statistical logical inference—in biomedical contexts. Method: We introduce BioCrit, the first benchmark tailored to professional medical education, constructed from authentic French medical licensing exam questions derived from 37 peer-reviewed biomedical papers (534 items). It systematically evaluates scientific literature comprehension and reliability of reasoning under domain-specific constraints. Using both open-source and commercial LLMs, we conduct multi-shot contextual evaluation, combining exact-match accuracy with granular analysis of intermediate reasoning chains. Results: State-of-the-art models achieve ≤0.5 accuracy; generative reasoning yields marginal gains, yet fundamental bottlenecks persist in core inferential steps. This work establishes the first rigorous, domain-grounded evaluation framework for critical reasoning in biomedicine, enabling evidence-based assessment of LLM trustworthiness in high-stakes clinical and educational applications.

Technology Category

Application Category

📝 Abstract
Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal.
Problem

Research questions and friction points this paper is trying to address.

Evaluating critical reasoning abilities of LLMs in biomedical literature analysis
Assessing LLM performance on study limitations and statistical analysis questions
Providing challenging benchmark for grounded reasoning in specialized medical domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset from French medical student exams
Evaluates critical reasoning on scientific papers
Benchmarks LLMs with intermediate reasoning tokens
🔎 Similar Papers
No similar papers found.
D
Doria Bonzi
University of Lorraine, LORIA, France
A
A. Guiggi
University Grenoble-Alpes, France
F
Fr'ed'eric B'echet
Aix-Marseille University, LIS, France
Carlos Ramisch
Carlos Ramisch
Aix Marseille University
Computational Linguistics
Benoit Favre
Benoit Favre
Professeur CNU 27, LIS UMR 7020, Aix-Marseille University
Natural Language ProcessingSpoken Language UnderstandingParsingMachine Learning