Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

📅 2024-03-18
🏛️ North American Chapter of the Association for Computational Linguistics
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Existing singing voice synthesis (SVS) methods lack flexible, natural-language-based control over stylistic attributes such as gender, vocal range, and loudness. This work introduces the first text-driven controllable SVS system enabling explicit, fine-grained multi-attribute editing. Methodologically: (1) we propose a multi-scale pitch representation that disentangles vocal range from melody; (2) we design a text encoder fine-tuning strategy coupled with cross-modal (speech + singing) data augmentation; and (3) we pioneer end-to-end integration of natural language instructions into the SVS generation pipeline using a decoder-only transformer architecture. Experiments demonstrate that our system significantly outperforms baselines in both control accuracy and audio naturalness, while achieving high melodic fidelity and superior sound quality. All generated audio samples are publicly released for reproducibility and verification.

Technology Category

Application Category

📝 Abstract
Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation that enables text-conditioned vocal range control while keeping melodic accuracy. Furthermore, we explore various experiment settings, including different types of text representations, text encoder fine-tuning, and introducing speech data to alleviate data scarcity, aiming to facilitate further research. Experiments show that our model achieves favorable controlling ability and audio quality. Audio samples are available at http://prompt-singer.github.io .
Problem

Research questions and friction points this paper is trying to address.

Singing Voice Synthesis
Style Transfer
Voice Conversion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt-Singer
Style Control
Speech-to-Song Synthesis
🔎 Similar Papers
No similar papers found.