Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of training data and evaluation benchmarks for long-context audio reasoning, which hinders open-ended long-form audio generation and summarization. The authors propose the first end-to-end, open-source framework that synthesizes triadic medical consultations—comprising patient–clinician dialogues, multi-speaker audio, and structured clinical notes. The pipeline employs a role-playing large language model to generate initial-visit dialogues, which are then rendered into realistic multi-speaker speech incorporating overlapping utterances, pauses, room acoustics, and ambient noise, followed by automatic generation of SOAP-format clinical summaries. The project releases 8,800 synthetic dialogues (totaling 1,300 hours of audio) with corresponding reference summaries, filling a critical gap in medical long-audio datasets. Evaluations demonstrate that a cascaded approach significantly outperforms end-to-end models.
📝 Abstract
Long-context audio reasoning is underserved in both training data and evaluation. Existing benchmarks target short-context tasks, and the open-ended generation tasks most relevant to long-context reasoning pose well-known challenges for automatic evaluation. We propose a synthetic data generation pipeline designed to serve both as a training resource and as a controlled evaluation environment, and instantiate it for first-visit doctor-patient conversations with SOAP note generation as the task. The pipeline has three stages, persona-driven dialogue generation, multi-speaker audio synthesis with overlap/pause modeling, room acoustics, and sound events, and LLM-based reference SOAP note production, built entirely on open-weight models. We release 8,800 synthetic conversations with 1.3k hours of corresponding audio and reference notes. Evaluating current open-weight systems, we find that cascaded approaches still substantially outperform end-to-end models.
Problem

Research questions and friction points this paper is trying to address.

long-context audio reasoning
synthetic data generation
automatic evaluation
doctor-patient conversations
audio summarization
Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic data generation
long-context audio
multi-speaker audio synthesis
SOAP note summarization
open-weight models
🔎 Similar Papers
No similar papers found.
Y
Yanis Labrak
Idiap Research Institute, Switzerland
D
David Grünert
University of Zurich, Switzerland
S
Séverin Baroudi
Idiap Research Institute, Switzerland; Université de Toulon, Aix Marseille Univ, LIS, CNRS, France
J
Jiyun Chun
The Ohio State University, USA
P
Pawel Cyrta
Stenograf, Poland
Sergio Burdisso
Sergio Burdisso
Researcher, Idiap Research Institute
artificial intelligencemachine learningnatural language processing
A
Ahmed Hassoon
Johns Hopkins University Bloomberg School of Public Health, USA
D
David Liu
Colorado School of Mines, USA
A
Adam Rothschild
Allegheny Health Network, USA
R
Reed Van Deusen
University of Pittsburgh Medical Center, USA
Petr Motlicek
Petr Motlicek
Idiap Research Institute
Artificial intelligencespeech and signal processingmachine learning
Andrew Perrault
Andrew Perrault
Assistant Professor, Dept. of Computer Science and Engineering
Artificial IntelligenceGame TheoryMachine LearningOptimization
Ricard Marxer
Ricard Marxer
Université de Toulon, Aix Marseille Univ, CNRS, LIS
machine learningaudio processingroboticscomputer vision
Thomas Schaaf
Thomas Schaaf
Carnegie Mellon University
Speech and Language ProcessingAutomatic Speech RecognitionNatural Language UnderstandingMachine LearningArtificial Intelligence