UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current speech-based dialogue models lack fine-grained, multi-dimensional voice style control, hindering natural human-like interaction. To address this, we introduce UltraVoice—the first large-scale, fine-grained voice style control dialogue dataset (830+ hours), covering six dimensions: emotion, speaking rate, volume, accent, language, and compound styles. We propose a fine-tuning framework integrating SLAM-Omni and VocalNet, enhanced with instruction-following mechanisms to jointly optimize functional capabilities (e.g., reasoning) and expressive ones (i.e., multi-dimensional controllable TTS). Experiments demonstrate substantial improvements: MOS scores increase by 29.12–42.33% and Instruction-Following Rates (IFR) rise by 14.61–40.09 percentage points on UltraVoice’s multi-dimensional control tasks; on URO-Bench, average performance improves by 10.84% (general setting) and 7.87% (expert setting). This work establishes the first systematic advancement toward human-like expressive capability in speech dialogue models.

Technology Category

Application Category

📝 Abstract
Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks designed in the UltraVoice. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset's utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis. The complete dataset and model checkpoints are available at: https://github.com/bigai-nlco/UltraVoice.
Problem

Research questions and friction points this paper is trying to address.

Lacking fine-grained speech style control in spoken dialogue models
Addressing limited stylistic dimensions like emotion, speed, and accent
Enhancing speech controllability without degrading conversational abilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for fine-grained speech style control
Fine-tuning models to enhance stylistic controllability without degradation
Training controllable Text-to-Speech models with broad applicability
🔎 Similar Papers
No similar papers found.
W
Wenming Tu
X-LANCE Lab, Shanghai Jiao Tong University
Guanrou Yang
Guanrou Yang
Shanghai Jiao Tong University
Ruiqi Yan
Ruiqi Yan
Shanghai Jiao Tong University
Deep learningAudioSpeech
W
Wenxi Chen
X-LANCE Lab, Shanghai Jiao Tong University
Z
Ziyang Ma
X-LANCE Lab, Shanghai Jiao Tong University
Yipeng Kang
Yipeng Kang
BIGAI
Natural language processing
K
Kai Yu
X-LANCE Lab, Shanghai Jiao Tong University
X
Xie Chen
X-LANCE Lab, Shanghai Jiao Tong University
Z
Zilong Zheng
State Key Laboratory of General Artificial Intelligence, BIGAI