voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models

📅 2026-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the automatic classification of four vocal modes in singing—breathy, neutral, flow, and pressed phonation—by leveraging self-supervised speech foundation models such as HuBERT and wav2vec 2.0. The proposed approach extracts hierarchical embeddings from early layers of these models, applies global temporal pooling, and employs lightweight classifiers (SVM or XGBoost) to achieve effective vocal mode recognition. This work presents the first empirical validation of the transferability of general-purpose speech foundation models to singing vocal tasks, demonstrating that early-layer embeddings outperform conventional handcrafted acoustic features. On a soprano dataset, the method achieves 95.7% accuracy, surpassing spectral baselines by 12–15% and overcoming longstanding performance limitations of traditional approaches.

Technology Category

Application Category

📝 Abstract
We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. Prior work on singing phonation has relied on handcrafted signal features or task-specific neural nets; this work evaluates the transferability of speech foundation models to singing phonation classification. voice2mode extracts layer-wise representations from HuBERT and two wav2vec2 variants, applies global temporal pooling, and classifies the pooled embeddings with lightweight classifiers (SVM, XGBoost). Experiments on a publicly available soprano dataset (763 sustained vowel recordings, four labels) show that foundation-model features substantially outperform conventional spectral baselines (spectrogram, mel-spectrogram, MFCC). HuBERT embeddings obtained from early layers yield the best result (~95.7% accuracy with SVM), an absolute improvement of ~12-15% over the best traditional baseline. We also show layer-wise behaviour: lower layers, which retain acoustic/phonetic detail, are more effective than top layers specialized for Automatic Speech Recognition (ASR).
Problem

Research questions and friction points this paper is trying to address.

phonation mode
singing voice
voice classification
speech foundation models
self-supervised learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised speech models
phonation mode classification
HuBERT
singing voice analysis
transfer learning
🔎 Similar Papers
No similar papers found.
A
Aju Ani Justus
University of Birmingham, School of Computer Science, Birmingham, UK
Ruchit Agrawal
Ruchit Agrawal
University of Oxford
Machine LearningNatural Language ProcessingAI for HealthcareMulti-modal Deep Learning
Sudarsana Reddy Kadiri
Sudarsana Reddy Kadiri
University of Southern California
Speech ProcessingBiomedical SignalsMultimodalityHealthcare InformaticsDeep Learning
S
Shrikanth Narayanan
University of Southern California, Department of Electrical and Computer Engineering, LA, USA