DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current conversational speech synthesis (CSS) systems predominantly rely on deterministic modeling, limiting their ability to simultaneously achieve response diversity, contextual coherence, and expressive emotional prosody, while lacking large language model (LLM)-driven end-to-end architectures. To address these limitations, we propose the first LLM-driven, diffusion-model-enhanced, context-aware CSS framework. Our method comprises: (1) a diffusion-based stochastic prosody predictor enabling controllable, multimodal dialogue-conditioned prosody sampling; and (2) the first language-model-based, explicitly prosody-controllable end-to-end CSS system. Experiments demonstrate substantial improvements over existing CSS models in speech diversity, contextual consistency, and emotional naturalness. The framework achieves state-of-the-art performance on both objective metrics (e.g., MCD, F0 RMSE) and subjective MOS evaluations, validating its effectiveness in generating high-fidelity, contextually grounded, and emotionally expressive speech.

Technology Category

Application Category

📝 Abstract
Conversational speech synthesis (CSS) aims to synthesize both contextually appropriate and expressive speech, and considerable efforts have been made to enhance the understanding of conversational context. However, existing CSS systems are limited to deterministic prediction, overlooking the diversity of potential responses. Moreover, they rarely employ language model (LM)-based TTS backbones, limiting the naturalness and quality of synthesized speech. To address these issues, in this paper, we propose DiffCSS, an innovative CSS framework that leverages diffusion models and an LM-based TTS backbone to generate diverse, expressive, and contextually coherent speech. A diffusion-based context-aware prosody predictor is proposed to sample diverse prosody embeddings conditioned on multimodal conversational context. Then a prosody-controllable LM-based TTS backbone is developed to synthesize high-quality speech with sampled prosody embeddings. Experimental results demonstrate that the synthesized speech from DiffCSS is more diverse, contextually coherent, and expressive than existing CSS systems
Problem

Research questions and friction points this paper is trying to address.

Enhance conversational speech diversity
Improve speech naturalness and quality
Generate contextually coherent expressive speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion models for diverse prosody
LM-based TTS for natural speech
Context-aware multimodal prosody predictor
🔎 Similar Papers
No similar papers found.
Weihao Wu
Weihao Wu
Tsinghua University
Z
Zhiwei Lin
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Y
Yixuan Zhou
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Jingbei Li
Jingbei Li
Tsinghua University
R
Rui Niu
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; Tencent Youtu Lab4StepFun
Qinghua Wu
Qinghua Wu
腾讯优图实验室
人工智能/语音/二次元
Songjun Cao
Songjun Cao
Tencent
speech understandingspeech generationmulti-modalLLM
Long Ma
Long Ma
Dalian University of Technology
Computer VisionImage Processing
Z
Zhiyong Wu
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; The Chinese University of Hong Kong, Hong Kong SAR, China