🤖 AI Summary
Existing singing voice synthesis methods struggle to achieve fine-grained, controllable generation of vocal techniques (e.g., falsetto, breathy voice, mixed voice, aspiration), limiting expressiveness and realism. To address this, we propose the first multilingual, multi-technique controllable singing synthesis framework supporting five languages and seven vocal techniques. Our method introduces (i) a novel phoneme-level automatic vocal technique annotation scheme and a natural language prompt-driven technique prediction mechanism; and (ii) the first application of flow matching to multilingual, multi-technique singing synthesis, integrated with a custom technique detection network and joint phoneme–technique modeling. Experiments demonstrate significant improvements over state-of-the-art methods: +0.8 in MOS (audio quality) and +12.3% in technique control accuracy. The code, pre-trained models, and open-source audio samples are publicly released.
📝 Abstract
Singing voice synthesis has made remarkable progress in generating natural and high-quality voices. However, existing methods rarely provide precise control over vocal techniques such as intensity, mixed voice, falsetto, bubble, and breathy tones, thus limiting the expressive potential of synthetic voices. We introduce TechSinger, an advanced system for controllable singing voice synthesis that supports five languages and seven vocal techniques. TechSinger leverages a flow-matching-based generative model to produce singing voices with enhanced expressive control over various techniques. To enhance the diversity of training data, we develop a technique detection model that automatically annotates datasets with phoneme-level technique labels. Additionally, our prompt-based technique prediction model enables users to specify desired vocal attributes through natural language, offering fine-grained control over the synthesized singing. Experimental results demonstrate that TechSinger significantly enhances the expressiveness and realism of synthetic singing voices, outperforming existing methods in terms of audio quality and technique-specific control. Audio samples can be found at https://tech-singer.github.io.