ART: Action-based Reasoning Task Benchmarking for Medical AI Agents

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Current medical AI benchmarks inadequately assess multi-step action reasoning, particularly in threshold judgment, temporal aggregation, and conditional logic. This work systematically identifies and focuses on three critical error types—retrieval, aggregation, and conditional reasoning—and introduces a clinically interpretable, action-oriented reasoning benchmark constructed from real-world electronic health records (EHRs). By integrating EHR data mining, structured task generation, clinical validation, and automated evaluation, the authors generate 600 diverse tasks. Evaluations reveal that leading large language models (e.g., GPT-4o-mini, Claude 3.5 Sonnet) achieve high retrieval accuracy but exhibit pronounced weaknesses in aggregation (28%–64%) and threshold-based reasoning (32%–38%), exposing a fundamental gap in current medical AI systems’ capacity for complex clinical decision-making.

Technology Category

Application Category

📝 Abstract

Reliable clinical decision support requires medical AI agents capable of safe, multi-step reasoning over structured electronic health records (EHRs). While large language models (LLMs) show promise in healthcare, existing benchmarks inadequately assess performance on action-based tasks involving threshold evaluation, temporal aggregation, and conditional logic. We introduce ART, an Action-based Reasoning clinical Task benchmark for medical AI agents, which mines real-world EHR data to create challenging tasks targeting known reasoning weaknesses. Through analysis of existing benchmarks, we identify three dominant error categories: retrieval failures, aggregation errors, and conditional logic misjudgments. Our four-stage pipeline -- scenario identification, task generation, quality audit, and evaluation -- produces diverse, clinically validated tasks grounded in real patient data. Evaluating GPT-4o-mini and Claude 3.5 Sonnet on 600 tasks shows near-perfect retrieval after prompt refinement, but substantial gaps in aggregation (28--64%) and threshold reasoning (32--38%). By exposing failure modes in action-oriented EHR reasoning, ART advances toward more reliable clinical agents, an essential step for AI systems that reduce cognitive load and administrative burden, supporting workforce capacity in high-demand care settings

Problem

Research questions and friction points this paper is trying to address.

Action-based Reasoning

Medical AI Agents

Electronic Health Records

Clinical Decision Support

Benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Action-based Reasoning

Medical AI Agents

Electronic Health Records (EHR)