MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

265K/year

🤖 AI Summary

This work addresses the lack of effective evaluation frameworks for memory accuracy, safety, and clinical traceability in existing healthcare AI agents within high-risk, long-term scenarios. The authors propose the first agent memory benchmark tailored to personalized medicine, leveraging human-in-the-loop collaboration to generate longitudinal, clinically grounded interaction trajectories. They introduce a novel “build-and-evaluate-as-you-go” dynamic streaming protocol that systematically uncovers memory saturation effects and exposes fundamental limitations of prevailing architectures in complex reasoning and noise robustness. The study releases a high-quality benchmark dataset comprising approximately 2,000 sessions and 16,000 interaction turns, expert-validated to reveal critical performance bottlenecks in current agents’ medical memory capabilities, thereby establishing a reliable foundation for evaluating production-grade systems.

📝 Abstract

The large-scale deployment of personalized healthcare agents demands memory mechanisms that are exceptionally precise, safe, and capable of long-term clinical tracking. However, existing benchmarks primarily focus on daily open-domain conversations, failing to capture the high-stakes complexity of real-world medical applications. Motivated by the stringent production requirements of an industry-leading health management agent serving tens of millions of active users, we introduce MedMemoryBench. We develop a human-agent collaborative pipeline to synthesize highly realistic, long-horizon medical trajectories based on clinically grounded, synthetic patient archetypes. This process yields a massive, expertly validated dataset comprising approximately 2,000 sessions and 16,000 interaction turns. Crucially, MedMemoryBench departs from traditional static evaluations by pioneering an "evaluate-while-constructing" streaming assessment protocol, which precisely mirrors dynamic memory accumulation in production environments. Furthermore, we formalize and systematically investigate the critical phenomenon of memory saturation, where sustained information influx actively degrades retrieval and reasoning robustness. Comprehensive benchmarking reveals severe bottlenecks in mainstream architectures, particularly concerning complex medical reasoning and noise resilience. By exposing these fundamental flaws, MedMemoryBench establishes a vital foundation for developing robust, production-ready medical agents.

Problem

Research questions and friction points this paper is trying to address.

agent memory

personalized healthcare

medical benchmarking

memory saturation

clinical reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

medical agent memory

streaming evaluation

memory saturation