Retrieval-Augmented Guardrails for AI-Drafted Patient-Portal Messages: Error Taxonomy Construction and Large-Scale Evaluation

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This study addresses clinical errors, information omissions, and inappropriate tone in large language model (LLM)-assisted generation of patient–clinician portal messages. We propose the first fine-grained clinical error ontology for this domain—comprising five categories and 59 error codes—and introduce RAEC, a retrieval-augmented evaluation framework. RAEC employs a two-stage DSPy prompting architecture that semantically retrieves analogous dialogue pairs from institutional historical electronic health records to enhance context-aware assessment of AI-generated drafts. Evaluated on 1,500 real-world messages, RAEC increases inter-annotator agreement from 33% to 50% and achieves an F1 score of 0.500, significantly outperforming baseline methods. This work establishes the first clinical error taxonomy specifically designed for patient-authored portal messages and empirically validates a historical-data-driven, interpretable, and high-agreement evaluation paradigm.

Technology Category

Application Category

📝 Abstract

Asynchronous patient-clinician messaging via EHR portals is a growing source of clinician workload, prompting interest in large language models (LLMs) to assist with draft responses. However, LLM outputs may contain clinical inaccuracies, omissions, or tone mismatches, making robust evaluation essential. Our contributions are threefold: (1) we introduce a clinically grounded error ontology comprising 5 domains and 59 granular error codes, developed through inductive coding and expert adjudication; (2) we develop a retrieval-augmented evaluation pipeline (RAEC) that leverages semantically similar historical message-response pairs to improve judgment quality; and (3) we provide a two-stage prompting architecture using DSPy to enable scalable, interpretable, and hierarchical error detection. Our approach assesses the quality of drafts both in isolation and with reference to similar past message-response pairs retrieved from institutional archives. Using a two-stage DSPy pipeline, we compared baseline and reference-enhanced evaluations on over 1,500 patient messages. Retrieval context improved error identification in domains such as clinical completeness and workflow appropriateness. Human validation on 100 messages demonstrated superior agreement (concordance = 50% vs. 33%) and performance (F1 = 0.500 vs. 0.256) of context-enhanced labels vs. baseline, supporting the use of our RAEC pipeline as AI guardrails for patient messaging.

Problem

Research questions and friction points this paper is trying to address.

Detecting clinical inaccuracies in AI-drafted patient messages

Evaluating workflow appropriateness of automated message responses

Improving error identification using retrieval-augmented evaluation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Clinically grounded error ontology with granular codes

Retrieval-augmented evaluation using historical message pairs

Two-stage DSPy pipeline for hierarchical error detection

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

fetch failed

Authors to Follow