🤖 AI Summary
Existing medical dialogue datasets lack authentic long-term temporal structures, making it difficult to evaluate an agent’s ability to remember and reason over a patient’s longitudinal medical history. To address this gap, this work proposes a knowledge-guided three-stage framework to synthesize MediLongChat—a high-quality, longitudinal medical dialogue dataset with coherent patient histories—and introduces the first benchmark task enabling cross-session reasoning. We define five quality dimensions—Faithfulness, Coherence, Diversity, Correctness, and Realism—and employ a multimodal evaluation strategy combining vector-based metrics and LLM-as-a-judge assessments. Experimental results demonstrate that even state-of-the-art large language models perform poorly on this benchmark, underscoring both the challenge posed by MediLongChat and the urgent need for specialized medical agents capable of longitudinal reasoning.
📝 Abstract
An effective healthcare agent must be able to recall and reason over a patient's longitudinal medical history. However, the absence of datasets with realistic long-term dialogue timelines limits systematic evaluation. Real clinical text is constrained by privacy and ethics, while existing benchmarks focus on isolated interactions, failing to capture cross-session reasoning. We introduce a framework for synthesizing high-quality, long-term medical dialogues with LLMs. Our approach entails a knowledge-guided decomposition into three stages: constructing synthetic patient profiles with diverse disease and complication trajectories, generating multi-turn dialogues per encounter, and integrating them into a coherent longitudinal history dataset, MediLongChat. We establish three benchmark tasks-In-dialogue Reasoning, Cross-dialogue Reasoning, and Synthesis Reasoning-to evaluate the memory capabilities of healthcare agents. To assess data quality, we introduce a multi-dimensional evaluation framework combining vector-based metrics with LLM-as-a-judge assessments. Specifically, we define automatic measures-Faithfulness, Coherence, and Diversity-together with two LLM-based evaluations: Correctness and Realism. Benchmark experiments show that even state-of-the-art LLMs struggle with MediLongChat. These findings highlight the benchmark's applicability and underscore the need for tailored methods to advance healthcare agents.