🤖 AI Summary
This work addresses the lack of rigorous evaluation of reasoning capabilities in inference-augmented large language models (LLMs) within real-world clinical settings. We introduce MedR-Bench, the first fine-grained medical reasoning benchmark, comprising 1,453 structured, real-world cases spanning 13 organ systems and 10 specialty disease categories, designed to systematically assess clinical reasoning across recommendation, diagnosis, and treatment planning stages. We propose Reasoning Evaluator, an automated assessment framework that enables dynamic, cross-validated evaluation of free-text reasoning outputs along three dimensions: efficiency, factual consistency, and completeness. Experiments reveal that state-of-the-art reasoning LLMs achieve >85% accuracy on simple diagnostic tasks but exhibit significant performance degradation in recommendation and treatment planning. Although overall factual consistency exceeds 90%, critical reasoning steps are frequently omitted. This study identifies key bottlenecks in medical reasoning and establishes a novel paradigm for trustworthy clinical AI evaluation.
📝 Abstract
The latest reasoning-enhanced large language models (reasoning LLMs), such as DeepSeek-R1 and OpenAI-o3, have demonstrated remarkable success. However, the application of such reasoning enhancements to the highly professional medical domain has not been clearly evaluated, particularly regarding with not only assessing the final generation but also examining the quality of their reasoning processes. In this study, we present MedR-Bench, a reasoning-focused medical evaluation benchmark comprising 1,453 structured patient cases with reasoning references mined from case reports. Our benchmark spans 13 body systems and 10 specialty disorders, encompassing both common and rare diseases. In our evaluation, we introduce a versatile framework consisting of three critical clinical stages: assessment recommendation, diagnostic decision-making, and treatment planning, comprehensively capturing the LLMs' performance across the entire patient journey in healthcare. For metrics, we propose a novel agentic system, Reasoning Evaluator, designed to automate and objectively quantify free-text reasoning responses in a scalable manner from the perspectives of efficiency, factuality, and completeness by dynamically searching and performing cross-referencing checks. As a result, we assess five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and others. Our results reveal that current LLMs can handle relatively simple diagnostic tasks with sufficient critical assessment results, achieving accuracy generally over 85%. However, they still struggle with more complex tasks, such as assessment recommendation and treatment planning. In reasoning, their reasoning processes are generally reliable, with factuality scores exceeding 90%, though they often omit critical reasoning steps. Our study clearly reveals further development directions for current clinical LLMs.