FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering

📅 2025-09-12
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess large language model (LLM) agents’ capabilities on real-world, HL7 FHIR-standardized clinical data. Method: We introduce FHIR-AgentBench—the first realistic clinical question-answering benchmark grounded in FHIR, comprising 2,931 clinically grounded questions derived from authentic electronic health record (EHR) scenarios. For the first time, we systematically evaluate LLM agents within the FHIR resource model framework, comparing strategies including API-based vs. domain-specific tool usage, single-turn vs. multi-turn interaction, and natural-language vs. code-based generation—thereby uncovering core challenges: retrieval ambiguity, cross-resource relational modeling, and multi-step clinical reasoning. Contribution/Results: We publicly release an open-source dataset and evaluation suite. Empirical analysis identifies critical performance bottlenecks of current agent approaches on structured clinical queries, advancing reproducible, interoperable research on clinical AI agents.

Technology Category

Application Category

📝 Abstract
The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources (HL7 FHIR) standard opens a new frontier for clinical AI, demanding LLM agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent LLMs on interoperable clinical data. To bridge this gap, we introduce FHIR-AgentBench, a benchmark that grounds 2,931 real-world clinical questions in the HL7 FHIR standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct FHIR API calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate FHIR resources and the difficulty of reasoning over them, both of which critically affect question answering performance. We publicly release the FHIR-AgentBench dataset and evaluation suite (https://github.com/glee4810/FHIR-AgentBench) to promote reproducible research and the development of robust, reliable LLM agents for clinical applications.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLM agents for realistic EHR question answering using FHIR standard
Evaluating data retrieval strategies and reasoning methods for clinical interoperability
Addressing challenges in navigating complex FHIR resources for healthcare applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark grounds clinical questions in FHIR standard
Evaluates retrieval strategies and interaction patterns systematically
Assesses reasoning approaches for complex healthcare data
🔎 Similar Papers
No similar papers found.
G
Gyubok Lee
Korea Advanced Institute of Science & Technology, South Korea
E
Elea Bach
Verily Life Sciences, USA
Eric Yang
Eric Yang
AI Scientist, Verily Life Sciences
T
Tom J. Pollard
Massachusetts Institute of Technology, USA
Alistair Johnson
Alistair Johnson
Unknown affiliation
Machine LearningCritical Care Medicine
Edward Choi
Edward Choi
KAIST
Machine LearningArtificial IntelligenceHealthcare
Y
Yugang Jia
Verily Life Sciences, USA
J
Jong Ha Lee
Verily Life Sciences, USA