🤖 AI Summary
This study addresses the lack of high-quality datasets that has hindered the automatic classification of the four vocal modes—Neutral, Curbing, Overdrive, and Edge—defined by Complete Vocal Technique (CVT), thereby impeding the development of intelligent vocal pedagogy. To bridge this gap, we present the first large-scale, multi-microphone synchronized CVT vocal dataset, comprising 3,752 phonation samples across the full vocal range from four singers (three of whom are certified CVT practitioners), resulting in over 13,000 audio recordings. Each sample is annotated independently by three experts, with consensus labels provided. Baseline experiments using natural data augmentation and a ResNet18 architecture achieve a balanced accuracy of 81.3% under five-fold cross-validation, demonstrating the dataset’s efficacy and establishing a reliable benchmark for future research in automatic vocal analysis and intelligent voice training systems.
📝 Abstract
The Complete Vocal Technique (CVT) is a school of singing developed in the past decades by Cathrin Sadolin et al.. CVT groups the use of the voice into so called vocal modes, namely Neutral, Curbing, Overdrive and Edge. Knowledge of the desired vocal mode can be helpful for singing students. Automatic classification of vocal modes can thus be important for technology-assisted singing teaching. Previously, automatic classification of vocal modes has been attempted without major success, potentially due to a lack of data. Therefore, we recorded a novel vocal mode dataset consisting of sustained vowels recorded from four singers, three of which professional singers with more than five years of CVT-experience. The dataset covers the entire vocal range of the subjects, totaling 3,752 unique samples. By using four microphones, thereby offering a natural data augmentation, the dataset consists of more than 13,000 samples combined. An annotation was created using three CVT-experienced annotators, each providing an individual annotation. The merged annotation as well as the three individual annotations come with the published dataset. Additionally, we provide some baseline classification results. The best balanced accuracy across a 5-fold cross validation of 81.3\,\% was achieved with a ResNet18. The dataset can be downloaded under https://zenodo.org/records/14276415.