SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Contemporary spoken dialogue systems (SDS) are constrained to speech-only responses, failing to meet the affective, mnemonic, and hedonic requirements of role-playing and interactive entertainment. To address this limitation, we propose the first end-to-end singable voice dialogue system capable of generating customized singing responses with multi-character support, multi-voice timbres, and multi-melodic sources. Our system adopts a modular ASR–LLM–SVS cascaded architecture that integrates automatic speech recognition, large language models, and singing voice synthesis—jointly optimizing response latency, audio fidelity, and musical style controllability. We release all source code and a plug-and-play web demo. This work significantly expands the expressive dimension and emotional expressivity of voice interaction, establishing a novel paradigm for embodied intelligence and immersive human–machine interaction.

Technology Category

Application Category

📝 Abstract
With recent advances in automatic speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS) technologies, spoken dialogue systems (SDS) have become widely accessible. However, most existing SDS are limited to conventional spoken responses. We present SingingSDS, a cascaded SDS that responds through singing rather than speaking, fostering more affective, memorable, and pleasurable interactions in character-based roleplay and interactive entertainment scenarios. SingingSDS employs a modular ASR-LLM-SVS pipeline and supports a wide range of configurations across character personas, ASR and LLM backends, SVS models, melody sources, and voice profiles, tailored to different needs in terms of latency, quality, and musical style. SingingSDS is available as a plug-and-play web demo, featuring modular, open-source code that supports customization and extension. Demo: https://huggingface.co/spaces/espnet/SingingSDS. Code: https://github.com/SingingSDS/SingingSDS.
Problem

Research questions and friction points this paper is trying to address.

Developing a spoken dialogue system that responds through singing instead of speaking
Enhancing affective and memorable interactions in roleplay and entertainment scenarios
Providing modular pipeline supporting diverse configurations for singing responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Employs modular ASR-LLM-SVS pipeline for singing responses
Supports configurable character personas and voice profiles
Provides plug-and-play web demo with open-source code
🔎 Similar Papers
No similar papers found.
J
Jionghao Han
Carnegie Mellon University
J
Jiatong Shi
Carnegie Mellon University
Masao Someki
Masao Someki
Carnegie Mellon University
Speech processing
Y
Yuxun Tang
Renmin University of China
L
Lan Liu
Renmin University of China
Y
Yiwen Zhao
Carnegie Mellon University
Wenhao Feng
Wenhao Feng
State Key Laboratory of Robotics and System, Harbin Institute of Technology
RoboticsSpace roboticsArtificial Intelligence
Shinji Watanabe
Shinji Watanabe
Carnegie Mellon University
Speech recognitionSpeech processingSpeech enhancementSpeech translation