From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations

📅 2025-09-20

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing medical computation benchmarks evaluate only final answers with lenient tolerance thresholds, obscuring systematic deficiencies of large language models (LLMs) in evidence-based medical reasoning and undermining clinical trustworthiness. Method: We propose a fine-grained, stepwise evaluation framework that reconstructs MedCalc-Bench and introduces structured error attribution analysis. We further design MedRaC—a modular, tuning-free agent pipeline integrating retrieval-augmented generation, Python code execution, and human verification—to enhance interpretability and reliability. Contribution/Results: Our framework reveals that GPT-4o’s true accuracy drops from 62.7% to 43.6%, exposing substantial overestimation in conventional evaluation. MedRaC elevates the multi-model average accuracy from 16.35% to 53.19%, demonstrating strong cross-model generalizability and robustness in complex medical computation tasks.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated promising performance on medical benchmarks; however, their ability to perform medical calculations, a crucial aspect of clinical decision-making, remains underexplored and poorly evaluated. Existing benchmarks often assess only the final answer with a wide numerical tolerance, overlooking systematic reasoning failures and potentially causing serious clinical misjudgments. In this work, we revisit medical calculation evaluation with a stronger focus on clinical trustworthiness. First, we clean and restructure the MedCalc-Bench dataset and propose a new step-by-step evaluation pipeline that independently assesses formula selection, entity extraction, and arithmetic computation. Under this granular framework, the accuracy of GPT-4o drops from 62.7% to 43.6%, revealing errors masked by prior evaluations. Second, we introduce an automatic error analysis framework that generates structured attribution for each failure mode. Human evaluation confirms its alignment with expert judgment, enabling scalable and explainable diagnostics. Finally, we propose a modular agentic pipeline, MedRaC, that combines retrieval-augmented generation and Python-based code execution. Without any fine-tuning, MedRaC improves the accuracy of different LLMs from 16.35% up to 53.19%. Our work highlights the limitations of current benchmark practices and proposes a more clinically faithful methodology. By enabling transparent and transferable reasoning evaluation, we move closer to making LLM-based systems trustworthy for real-world medical applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' clinical calculation accuracy beyond final answer scores

Identifying systematic reasoning failures masked by existing medical benchmarks

Developing trustworthy evaluation methods for real-world medical applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-by-step evaluation pipeline for medical calculations

Automatic error analysis framework for failure attribution

Modular agentic pipeline combining retrieval and code execution

🔎 Similar Papers

No similar papers found.