FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering

📅 2025-09-12

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing benchmarks inadequately assess large language model (LLM) agents’ capabilities on real-world, HL7 FHIR-standardized clinical data. Method: We introduce FHIR-AgentBench—the first realistic clinical question-answering benchmark grounded in FHIR, comprising 2,931 clinically grounded questions derived from authentic electronic health record (EHR) scenarios. For the first time, we systematically evaluate LLM agents within the FHIR resource model framework, comparing strategies including API-based vs. domain-specific tool usage, single-turn vs. multi-turn interaction, and natural-language vs. code-based generation—thereby uncovering core challenges: retrieval ambiguity, cross-resource relational modeling, and multi-step clinical reasoning. Contribution/Results: We publicly release an open-source dataset and evaluation suite. Empirical analysis identifies critical performance bottlenecks of current agent approaches on structured clinical queries, advancing reproducible, interoperable research on clinical AI agents.

Technology Category

Application Category

📝 Abstract

The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources (HL7 FHIR) standard opens a new frontier for clinical AI, demanding LLM agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent LLMs on interoperable clinical data. To bridge this gap, we introduce FHIR-AgentBench, a benchmark that grounds 2,931 real-world clinical questions in the HL7 FHIR standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct FHIR API calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate FHIR resources and the difficulty of reasoning over them, both of which critically affect question answering performance. We publicly release the FHIR-AgentBench dataset and evaluation suite (https://github.com/glee4810/FHIR-AgentBench) to promote reproducible research and the development of robust, reliable LLM agents for clinical applications.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLM agents for realistic EHR question answering using FHIR standard

Evaluating data retrieval strategies and reasoning methods for clinical interoperability

Addressing challenges in navigating complex FHIR resources for healthcare applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark grounds clinical questions in FHIR standard

Evaluates retrieval strategies and interaction patterns systematically

Assesses reasoning approaches for complex healthcare data

🔎 Similar Papers

No similar papers found.