Lightweight Self-Supervised Detection of Fundamental Frequency and Accurate Probability of Voicing in Monophonic Music

📅 2026-01-16

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the performance degradation of existing fundamental frequency (F₀) and voicing probability estimation methods in real-world recordings due to their reliance on large amounts of labeled data. The authors propose a lightweight, fully self-supervised joint estimation framework that efficiently trains using only a small amount of unlabeled audio. By integrating constant-Q transform (CQT) features, transposition-equivariant learning, and a shift-invariant cross-entropy consistency constraint, the method employs an EM-style iterative reweighting mechanism to generate high-confidence pseudo-labels for voicing state classification without any manual annotations. Evaluated on MDB-stem-synth after training on MedleyDB, the approach achieves strong cross-corpus performance with an RPA of 95.84 and an RCA of 96.24, while demonstrating excellent generalization across diverse musical instruments.

Technology Category

Application Category

📝 Abstract

Reliable fundamental frequency (F 0) and voicing estimation is essential for neural synthesis, yet many pitch extractors depend on large labeled corpora and degrade under realistic recording artifacts. We propose a lightweight, fully self-supervised framework for joint F 0 estimation and voicing inference, designed for rapid single-instrument training from limited audio. Using transposition-equivariant learning on CQT features, we introduce an EM-style iterative reweighting scheme that uses Shift Cross-Entropy (SCE) consistency as a reliability signal to suppress uninformative noisy/unvoiced frames. The resulting weights provide confidence scores that enable pseudo-labeling for a separate lightweight voicing classifier without manual annotations. Trained on MedleyDB and evaluated on MDB-stem-synth ground truth, our method achieves competitive cross-corpus performance (RPA 95.84, RCA 96.24) and demonstrates cross-instrument generalization.

Problem

Research questions and friction points this paper is trying to address.

fundamental frequency

voicing estimation

self-supervised learning

monophonic music

pitch detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised learning

fundamental frequency estimation

voicing detection