GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks

📅 2024-09-20

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing singing datasets suffer from low audio quality, insufficient linguistic and vocal-style diversity, absence of authentic musical scores, and lack of phoneme-level alignments—severely hindering controllable singing synthesis, style transfer, and related tasks. To address these limitations, we introduce SingBase, the first high-quality, task-comprehensive global singing corpus. SingBase comprises 80.59 hours of professionally recorded singing, spans nine languages, and features fine-grained annotations for six vocal styles, ground-truth musical scores (standard notation), and phoneme-level alignments. It is the first to systematically integrate multi-style controllable annotations, cross-lingual multi-singer sampling, score-driven modeling, and speech–singing paired design. Leveraging professional recording, meticulous manual annotation, global style modeling, and standardized musical notation, SingBase comprehensively supports four core benchmark tasks: singing synthesis, singing recognition, singing conversion, and singing style transfer. Currently, it is the largest open-source singing corpus, with all data and benchmark code publicly released.

Technology Category

Application Category

📝 Abstract

The scarcity of high-quality and multi-task singing datasets significantly hinders the development of diverse controllable and personalized singing tasks, as existing singing datasets suffer from low quality, limited diversity of languages and singers, absence of multi-technique information and realistic music scores, and poor task suitability. To tackle these problems, we present GTSinger, a large global, multi-technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks. Particularly, (1) we collect 80.59 hours of high-quality singing voices, forming the largest recorded singing dataset; (2) 20 professional singers across nine widely spoken languages offer diverse timbres and styles; (3) we provide controlled comparison and phoneme-level annotations of six commonly used singing techniques, helping technique modeling and control; (4) GTSinger offers realistic music scores, assisting real-world musical composition; (5) singing voices are accompanied by manual phoneme-to-audio alignments, global style labels, and 16.16 hours of paired speech for various singing tasks. Moreover, to facilitate the use of GTSinger, we conduct four benchmark experiments: technique-controllable singing voice synthesis, technique recognition, style transfer, and speech-to-singing conversion. The corpus and demos can be found at http://aaronz345.github.io/GTSingerDemo/. We provide the dataset and the code for processing data and conducting benchmarks at https://huggingface.co/datasets/GTSinger/GTSinger and https://github.com/AaronZ345/GTSinger.

Problem

Research questions and friction points this paper is trying to address.

Lack of high-quality multi-task singing datasets

Need for diverse language and singer representation

Absence of multi-technique data and realistic scores

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest recorded singing dataset

Multi-technique annotations provided

Realistic music scores included

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Member of Technical Staff - Voice Model

xAI

$150,000 - $450,000 USD

Palo Alto, CA / Palo Alto, CA, Palo Alto, California, United States

Research Scientist Intern, Multimodal AI (PhD)