How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the lack of robust evaluation of large language models (LLMs) in clinical numerical reasoning, particularly their poor performance on complex operations such as relational comparisons and aggregations across diverse textual formats. The authors introduce ClinicNumRobBench, a benchmark comprising 1,624 clinical instances derived from MIMIC-IV and Open Patients, covering four core numerical capabilities: extraction, calculation, comparison, and aggregation. To assess format robustness, the benchmark incorporates three semantically equivalent yet structurally distinct vital sign recording formats and 42 question templates. Systematic evaluation of 14 mainstream models reveals that while numerical extraction accuracy exceeds 85%, performance on complex tasks drops below 15%. Surprisingly, medical fine-tuning degrades numerical reasoning by over 30%, and models exhibit high sensitivity to note formatting. This study provides the first systematic assessment of LLMs’ numerical reasoning robustness in clinical settings, uncovering critical limitations and counterintuitive behaviors.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly being explored for clinical question answering and decision support, yet safe deployment critically requires reliable handling of patient measurements in heterogeneous clinical notes. Existing evaluations of LLMs for clinical numerical reasoning provide limited operation-level coverage, restricted primarily to arithmetic computation, and rarely assess the robustness of numerical understanding across clinical note formats. We introduce ClinicNumRobBench, a benchmark of 1,624 context-question instances with ground-truth answers that evaluates four main types of clinical numeracy: value retrieval, arithmetic computation, relational comparison, and aggregation. To stress-test robustness, ClinicNumRobBench presents longitudinal MIMIC-IV vital-sign records in three semantically equivalent representations, including a real-world note-style variant derived from the Open Patients dataset, and instantiates queries using 42 question templates. Experiments on 14 LLMs show that value retrieval is generally strong, with most models exceeding 85% accuracy, while relational comparison and aggregation remain challenging, with some models scoring below 15%. Fine-tuning on medical data can reduce numeracy relative to base models by over 30%, and performance drops under note-style variation indicate LLM sensitivity to format. ClinicNumRobBench offers a rigorous testbed for clinically reliable numerical reasoning. Code and data URL are available on https://github.com/MinhVuong2000/ClinicNumRobBench.

Problem

Research questions and friction points this paper is trying to address.

clinical numeracy

numerical reasoning

large language models

robustness

clinical notes

Innovation

Methods, ideas, or system contributions that make the work stand out.

clinical numeracy

numerical reasoning

robustness benchmark