MOSS-VoiceGenerator: Create Realistic Voices with Natural Language Descriptions

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an end-to-end speech generation approach driven by natural language instructions, addressing the limited expressiveness and diversity of current text-to-speech models that predominantly rely on studio-recorded data. By leveraging large-scale movie dialogue corpora and integrating open-source instruction-following architectures with in-the-wild speech, the method enables direct synthesis of highly realistic voices from free-form textual descriptions specifying character traits, personality, and emotional states. Subjective evaluations demonstrate that the proposed model significantly outperforms existing voice design systems in overall audio quality, fidelity to user instructions, and naturalness, marking a substantial advance toward human-like vocal expressivity in synthetic speech.
📝 Abstract
Voice design from natural language aims to generate speaker timbres directly from free-form textual descriptions, allowing users to create voices tailored to specific roles, personalities, and emotions. Such controllable voice creation benefits a wide range of downstream applications-including storytelling, game dubbing, role-play agents, and conversational assistants, making it a significant task for modern Text-to-Speech models. However, existing models are largely trained on carefully recorded studio data, which produces speech that is clean and well-articulated, yet lacks the lived-in qualities of real human voices. To address these limitations, we present MOSS-VoiceGenerator, an open-source instruction-driven voice generation model that creates new timbres directly from natural language prompts. Motivated by the hypothesis that exposure to real-world acoustic variation produces more perceptually natural voices, we train on large-scale expressive speech data sourced from cinematic content. Subjective preference studies demonstrate its superiority in overall performance, instruction-following, and naturalness compared to other voice design models.
Problem

Research questions and friction points this paper is trying to address.

voice design
natural language descriptions
text-to-speech
timbre generation
expressive speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-driven voice generation
natural language voice design
expressive speech synthesis
real-world acoustic variation
timbre generation
🔎 Similar Papers
K
Kexin Huang
L
Liwei Fan
B
Botian Jiang
Y
Yaozhou Jiang
Q
Qian Tu
J
Jie Zhu
Y
Yuqian Zhang
Y
Yiwei Zhao
C
Chenchen Yang
Zhaoye Fei
Zhaoye Fei
Fudan University
Natural Language Processing
Shimin Li
Shimin Li
Fudan University
Large Language ModelSpeech Language Model
X
Xiaogui Yang
Q
Qinyuan Cheng
X
Xipeng Qiu