🤖 AI Summary
Current speech-based dialogue models lack fine-grained, multi-dimensional voice style control, hindering natural human-like interaction. To address this, we introduce UltraVoice—the first large-scale, fine-grained voice style control dialogue dataset (830+ hours), covering six dimensions: emotion, speaking rate, volume, accent, language, and compound styles. We propose a fine-tuning framework integrating SLAM-Omni and VocalNet, enhanced with instruction-following mechanisms to jointly optimize functional capabilities (e.g., reasoning) and expressive ones (i.e., multi-dimensional controllable TTS). Experiments demonstrate substantial improvements: MOS scores increase by 29.12–42.33% and Instruction-Following Rates (IFR) rise by 14.61–40.09 percentage points on UltraVoice’s multi-dimensional control tasks; on URO-Bench, average performance improves by 10.84% (general setting) and 7.87% (expert setting). This work establishes the first systematic advancement toward human-like expressive capability in speech dialogue models.
📝 Abstract
Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks designed in the UltraVoice. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset's utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis. The complete dataset and model checkpoints are available at: https://github.com/bigai-nlco/UltraVoice.