🤖 AI Summary
This work addresses the clinical safety risks posed by conventional memory architectures in persistent health coaching agents, which overwrite prior information when integrating patient self-reports—vulnerable to recall bias—with structured but often outdated electronic health records (EHRs). To mitigate this, the authors propose a dual-stream memory architecture that strictly segregates unstructured patient narratives from standardized FHIR-based EHR data. A dedicated coordination engine systematically compares extracted memories, categorizing clinical discrepancies by type, severity, and FHIR resource. This approach achieves the first explicit separation and systematic reconciliation of patient-reported and EHR-derived information, revealing that 13.6% of error cascades originate during memory extraction and that loss of clinical detail primarily stems from inadequate information extraction from unstructured dialogue. Evaluated over 675 longitudinal sessions, the coordination engine attained an 84.4% detection rate for predefined clinical discrepancies and an 86.7% recall for safety-critical differences, demonstrating the feasibility and necessity of clinically grounded memory validation.
📝 Abstract
As Large Language Model (LLM) agents transition from single-session tools to persistent systems managing longitudinal healthcare journeys, their memory architectures face a critical challenge: reconciling two imperfect sources of truth. The patient's evolving self-report is current but prone to recall bias, while the Electronic Health Record (EHR) is medically validated but frequently stale. General-purpose agent memory systems optimize for coherence by overwriting older facts with the user's latest statement, a pattern that risks safety failures when applied to clinical data. We introduce a Dual-Stream Memory Architecture that strictly separates the patient narrative from the structured clinical record (FHIR), governed by a dedicated Reconciliation Engine that evaluates every extracted memory against the patient's FHIR profile and classifies discrepancies by type, severity, and the specific FHIR resources involved. We evaluate this architecture on 26 patients across 675 longitudinal wellness coaching sessions, using a hybrid dataset that interleaves real provider-patient transcripts with synthetic, FHIR-grounded clinical scenarios. In isolated testing, the engine detects 84.4% of designed clinical discrepancies with 86.7% safety-critical recall. By coupling extraction and reconciliation evaluation on the same data, we directly quantify a 13.6% error cascade, tracing the degradation to clinical details lost during memory extraction from unstructured conversation rather than to downstream classification errors. These findings establish that validating patient-reported memories against clinical records is both feasible and necessary for safe deployment of longitudinal health agents.