UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current speech-based dialogue models lack fine-grained, multi-dimensional voice style control, hindering natural human-like interaction. To address this, we introduce UltraVoice—the first large-scale, fine-grained voice style control dialogue dataset (830+ hours), covering six dimensions: emotion, speaking rate, volume, accent, language, and compound styles. We propose a fine-tuning framework integrating SLAM-Omni and VocalNet, enhanced with instruction-following mechanisms to jointly optimize functional capabilities (e.g., reasoning) and expressive ones (i.e., multi-dimensional controllable TTS). Experiments demonstrate substantial improvements: MOS scores increase by 29.12–42.33% and Instruction-Following Rates (IFR) rise by 14.61–40.09 percentage points on UltraVoice’s multi-dimensional control tasks; on URO-Bench, average performance improves by 10.84% (general setting) and 7.87% (expert setting). This work establishes the first systematic advancement toward human-like expressive capability in speech dialogue models.

Technology Category

Application Category

📝 Abstract

Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks designed in the UltraVoice. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset's utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis. The complete dataset and model checkpoints are available at: https://github.com/bigai-nlco/UltraVoice.

Problem

Research questions and friction points this paper is trying to address.

Lacking fine-grained speech style control in spoken dialogue models

Addressing limited stylistic dimensions like emotion, speed, and accent

Enhancing speech controllability without degrading conversational abilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for fine-grained speech style control

Fine-tuning models to enhance stylistic controllability without degradation

Training controllable Text-to-Speech models with broad applicability

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Member of Technical Staff - Voice Model

xAI

$150,000 - $450,000 USD

Palo Alto, CA / Palo Alto, CA, Palo Alto, California, United States

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs