Retrieval-Augmented Guardrails for AI-Drafted Patient-Portal Messages: Error Taxonomy Construction and Large-Scale Evaluation

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses clinical errors, information omissions, and inappropriate tone in large language model (LLM)-assisted generation of patient–clinician portal messages. We propose the first fine-grained clinical error ontology for this domain—comprising five categories and 59 error codes—and introduce RAEC, a retrieval-augmented evaluation framework. RAEC employs a two-stage DSPy prompting architecture that semantically retrieves analogous dialogue pairs from institutional historical electronic health records to enhance context-aware assessment of AI-generated drafts. Evaluated on 1,500 real-world messages, RAEC increases inter-annotator agreement from 33% to 50% and achieves an F1 score of 0.500, significantly outperforming baseline methods. This work establishes the first clinical error taxonomy specifically designed for patient-authored portal messages and empirically validates a historical-data-driven, interpretable, and high-agreement evaluation paradigm.

Technology Category

Application Category

📝 Abstract
Asynchronous patient-clinician messaging via EHR portals is a growing source of clinician workload, prompting interest in large language models (LLMs) to assist with draft responses. However, LLM outputs may contain clinical inaccuracies, omissions, or tone mismatches, making robust evaluation essential. Our contributions are threefold: (1) we introduce a clinically grounded error ontology comprising 5 domains and 59 granular error codes, developed through inductive coding and expert adjudication; (2) we develop a retrieval-augmented evaluation pipeline (RAEC) that leverages semantically similar historical message-response pairs to improve judgment quality; and (3) we provide a two-stage prompting architecture using DSPy to enable scalable, interpretable, and hierarchical error detection. Our approach assesses the quality of drafts both in isolation and with reference to similar past message-response pairs retrieved from institutional archives. Using a two-stage DSPy pipeline, we compared baseline and reference-enhanced evaluations on over 1,500 patient messages. Retrieval context improved error identification in domains such as clinical completeness and workflow appropriateness. Human validation on 100 messages demonstrated superior agreement (concordance = 50% vs. 33%) and performance (F1 = 0.500 vs. 0.256) of context-enhanced labels vs. baseline, supporting the use of our RAEC pipeline as AI guardrails for patient messaging.
Problem

Research questions and friction points this paper is trying to address.

Detecting clinical inaccuracies in AI-drafted patient messages
Evaluating workflow appropriateness of automated message responses
Improving error identification using retrieval-augmented evaluation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clinically grounded error ontology with granular codes
Retrieval-augmented evaluation using historical message pairs
Two-stage DSPy pipeline for hierarchical error detection
🔎 Similar Papers
No similar papers found.
Wenyuan Chen
Wenyuan Chen
University of Toronto
Computer visionRoboticsMedical ImagingDeep learning
F
Fateme Nateghi Haredasht
Stanford Center for Biomedical Informatics Research, Stanford, CA, USA
K
Kameron C Black
Department of Medicine, Stanford University, Stanford, CA, USA
F
Francois Grolleau
Stanford Center for Biomedical Informatics Research, Stanford, CA, USA
Emily Alsentzer
Emily Alsentzer
Assistant Professor, Stanford University
machine learning for healthcare
J
Jonathan H. Chen
Stanford Center for Biomedical Informatics Research, Stanford, CA, USA; Department of Medicine, Stanford University, Stanford, CA, USA; Stanford Clinical Excellence Research Center, Stanford University, Stanford, CA, USA
S
Stephen P Ma
Department of Medicine, Stanford University, Stanford, CA, USA