End-to-End Evaluation and Governance of an EHR-Embedded AI Agent for Clinicians

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the common lack of continuous evaluation and governance mechanisms in deployed clinical AI systems, which hinders dynamic performance optimization. The authors propose the first end-to-end continuous governance framework tailored for clinical AI, integrating standards-driven validation, A/B testing for controlled version updates, real-time performance monitoring, fault tolerance, and deep integration with electronic health records (EHRs) to establish a closed-loop synergy between engineering iteration and clinical feedback. Applied to Hyperscribe—a speech-to-structured-clinical-note system—the framework achieved substantial improvements over seven iterative cycles: median clinician rating increased from 84% to 95%, negative user feedback decreased from 79% to 30%, median audio processing latency was 8.1 seconds, and task completion rate reached 99.6%, collectively enhancing system reliability and user satisfaction.

📝 Abstract

Clinical AI systems require not just point-in-time evaluation but continuous governance: the ongoing practice of monitoring, evaluating, iterating, and re-evaluating performance throughout deployment. We present an end-to-end framework of governance that integrates rubric validation, live deployment feedback, technical performance monitoring, and cost tracking, with controlled experimentation gating system changes before deployment. Applied to Hyperscribe, an EHR-embedded agent that converts ambient audio into structured chart updates, twenty clinicians authored 1,646 validated rubrics across 823 cases. Seven Hyperscribe versions were evaluated through controlled experiments, with median scores improving from 84% to 95%. Analysis of 107 live feedback entries over three months showed feedback composition shifting from 79% error reports and 14% positive observations to 30% errors and 45% positive observations as engineering interventions resolved failures. Median processing time per audio segment was 8.1 seconds with a 99.6% effective completion rate after retry mechanisms absorbed transient model errors. These results demonstrate that continuous, multi-channel governance of deployed clinical AI is both achievable and effective.

Problem

Research questions and friction points this paper is trying to address.

clinical AI governance

continuous evaluation

EHR-embedded AI

deployment monitoring

AI performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous governance

EHR-embedded AI

controlled experimentation