Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing emotional-aware dialogue summarization research is hindered by the scarcity of datasets with aligned speech, factual summaries, and paralinguistic cues. To address this, we introduce the first large-scale triply-aligned dataset (13,460 samples), comprising raw dialogue audio, factual summaries, and emotion-infused summaries, annotated with speaker attributes (age/gender) and fine-grained paralinguistic features (emotion, pitch, speaking rate). We propose a novel controllable speech data construction paradigm: “LLM-based script rewriting + expressive TTS synthesis.” Leveraging this dataset, we design an end-to-end Audio-LLM model. On emotional summarization, it achieves a 28% ROUGE-L improvement over cascaded ASR-LLM systems, demonstrating—for the first time—the critical advantage of end-to-end speech modeling for emotion-aware dialogue summarization.

Technology Category

Application Category

📝 Abstract
Recent audio language models can follow long conversations. However, research on emotion-aware or spoken dialogue summarization is constrained by the lack of data that links speech, summaries, and paralinguistic cues. We introduce Spoken DialogSum, the first corpus aligning raw conversational audio with factual summaries, emotion-rich summaries, and utterance-level labels for speaker age, gender, and emotion. The dataset is built in two stages: first, an LLM rewrites DialogSum scripts with Switchboard-style fillers and back-channels, then tags each utterance with emotion, pitch, and speaking rate. Second, an expressive TTS engine synthesizes speech from the tagged scripts, aligned with paralinguistic labels. Spoken DialogSum comprises 13,460 emotion-diverse dialogues, each paired with both a factual and an emotion-focused summary. The dataset is available online at https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/. Baselines show that an Audio-LLM raises emotional-summary ROUGE-L by 28% relative to a cascaded ASR-LLM system, confirming the value of end-to-end speech modeling.
Problem

Research questions and friction points this paper is trying to address.

Lack of emotion-aware spoken dialogue summarization datasets
Need to align speech with summaries and paralinguistic cues
Improving emotional summary quality via end-to-end audio modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM rewrites scripts with fillers and emotion tags
Expressive TTS synthesizes speech from tagged scripts
Audio-LLM improves emotion summary performance end-to-end
🔎 Similar Papers
No similar papers found.
Y
Yen-Ju Lu
Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA
K
Kunxiao Gao
Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA
M
Mingrui Liang
Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA
H
Helin Wang
Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA
Thomas Thebaud
Thomas Thebaud
Assistant Research Scientist, ECE Dept., Johns Hopkins University, Baltimore
Adversarial and Backdoor attacksSpeech Emotion RecognitionAudio LLMsSpeaker Characterisation
L
L. Moro-Velázquez
Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA
N
N. Dehak
Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA
J
J. Villalba
Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA