When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This study addresses the limitation of current medical large language models, which predominantly rely on common knowledge from clinical guidelines and struggle with the numerous rare cases encountered in real-world practice that fall outside guideline coverage. To tackle this gap, the authors introduce OGCaReBench—the first open-domain, guideline-outside, case-based benchmark for evaluating free-text question answering on rare clinical scenarios. Built upon expert-validated case reports, the benchmark assesses models’ ability to perform open-ended reasoning grounded in evidence. It integrates case extraction, expert validation, large model evaluation, and retrieval-augmented generation with external literature. Experimental results show that even the strongest baseline model (GPT-5.2) correctly answers only 56% of questions, while retrieval augmentation substantially improves performance to 82%, underscoring the critical role of evidence retrieval in complex medical question answering.
📝 Abstract
Across medical specialties, clinical practice is anchored in evidence-based guidelines that codify best studied diagnostic and treatment pathways. These pathways routinely fall short for the long tail of real-world care not covered by guidelines. Most medical large language models (LLMs), however, are trained to encode common, guideline-focused medical knowledge in their parameters. Current evaluations test models primarily on recalling and reasoning with this memorized content, often in multiple-choice settings. Given the fundamental importance of evidence-based reasoning in medicine, it is neither feasible nor reliable to depend on memorization in practice. To address this gap, we introduce OGCaReBench, a free-form retrieval-focused benchmark aimed at evaluating LLMs at answering clinical questions that require going beyond typical guidelines. Extracted from published medical case reports and validated by medical experts, OGCaReBench contains long-form clinical questions requiring free-text answers, providing a systematic framework for assessing open-ended medical reasoning in rare, case-based scenarios. Our experiments reveal that even the best-performing baseline (GPT-5.2) correctly answers only 56% of our benchmark with specialized models only reaching 42%. Augmenting models with retrieved medical articles improves this performance to up to 82% (using GPT-5.2) highlighting the importance of evidence-grounding for real-world medical reasoning tasks. This work thus establishes a foundation for benchmarking and advancing both general-purpose and medical LLMs to produce reliable answers in challenging clinical contexts.
Problem

Research questions and friction points this paper is trying to address.

clinical question answering
off-guideline
rare cases
medical LLMs
evidence-based reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

retrieval-augmented reasoning
off-guideline clinical QA
medical case reports
evidence-based LLM evaluation
free-form medical benchmark
D
Doeun Lee
The Ohio State University
M
Muge Zhang
The Ohio State University
Y
Yi Yu
The Ohio State University
A
Ashish Manne
The Ohio State University Wexner Medical Center
S
Stephen Koesters
The Ohio State University Wexner Medical Center
F
Frank Wen
University of Chicago Medical Center
B
Brady Buchanan
The Ohio State University Wexner Medical Center
L
Lynda Villagomez
The Ohio State University Wexner Medical Center
O
Oluwatoba Moninuola
The Ohio State University Wexner Medical Center
J
James Lim
The Ohio State University Wexner Medical Center
K
Kathryn Tobin
The Ohio State University Wexner Medical Center
A
Andrew Srisuwananukorn
The Ohio State University Wexner Medical Center
Ping Zhang
Ping Zhang
The Ohio State University
Data MiningDeep LearningCausal AIMultimodal LLMAI in Medicine
Sachin Kumar
Sachin Kumar
The Ohio State University
Natural Language ProcessingMachine Learning