Synthesis and Evaluation of Long-term History-aware Medical Dialogue

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing medical dialogue datasets lack authentic long-term temporal structures, making it difficult to evaluate an agent’s ability to remember and reason over a patient’s longitudinal medical history. To address this gap, this work proposes a knowledge-guided three-stage framework to synthesize MediLongChat—a high-quality, longitudinal medical dialogue dataset with coherent patient histories—and introduces the first benchmark task enabling cross-session reasoning. We define five quality dimensions—Faithfulness, Coherence, Diversity, Correctness, and Realism—and employ a multimodal evaluation strategy combining vector-based metrics and LLM-as-a-judge assessments. Experimental results demonstrate that even state-of-the-art large language models perform poorly on this benchmark, underscoring both the challenge posed by MediLongChat and the urgent need for specialized medical agents capable of longitudinal reasoning.

📝 Abstract

An effective healthcare agent must be able to recall and reason over a patient's longitudinal medical history. However, the absence of datasets with realistic long-term dialogue timelines limits systematic evaluation. Real clinical text is constrained by privacy and ethics, while existing benchmarks focus on isolated interactions, failing to capture cross-session reasoning. We introduce a framework for synthesizing high-quality, long-term medical dialogues with LLMs. Our approach entails a knowledge-guided decomposition into three stages: constructing synthetic patient profiles with diverse disease and complication trajectories, generating multi-turn dialogues per encounter, and integrating them into a coherent longitudinal history dataset, MediLongChat. We establish three benchmark tasks-In-dialogue Reasoning, Cross-dialogue Reasoning, and Synthesis Reasoning-to evaluate the memory capabilities of healthcare agents. To assess data quality, we introduce a multi-dimensional evaluation framework combining vector-based metrics with LLM-as-a-judge assessments. Specifically, we define automatic measures-Faithfulness, Coherence, and Diversity-together with two LLM-based evaluations: Correctness and Realism. Benchmark experiments show that even state-of-the-art LLMs struggle with MediLongChat. These findings highlight the benchmark's applicability and underscore the need for tailored methods to advance healthcare agents.

Problem

Research questions and friction points this paper is trying to address.

long-term medical dialogue

longitudinal medical history

cross-session reasoning

healthcare agent evaluation

medical dialogue dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

longitudinal medical dialogue

synthetic data generation

cross-session reasoning