🤖 AI Summary
This work addresses numerical instability and insufficient theoretical grounding in self-supervised learning for monophonic pitch estimation. We propose the first translation-equivariant self-supervised learning framework grounded in optimal transport (OT). Our method formalizes pitch translation invariance as Wasserstein distance minimization between one-dimensional probability distributions, yielding a theoretically rigorous, differentiable, and numerically stable loss function. Coupled with a translation-equivariant neural architecture, the framework enables end-to-end optimization. Crucially, this is the first systematic integration of OT theory into one-dimensional translation-equivariant signal modeling—replacing heuristic contrastive or reconstruction objectives prevalent in prior work. Evaluated on standard monophonic pitch estimation benchmarks, our approach achieves state-of-the-art performance, with marked improvements in training stability and generalization accuracy. These results empirically validate the framework’s theoretical soundness, computational robustness, and practical efficacy.
📝 Abstract
In this paper, we propose an Optimal Transport objective for learning one-dimensional translation-equivariant systems and demonstrate its applicability to single pitch estimation. Our method provides a theoretically grounded, more numerically stable, and simpler alternative for training state-of-the-art self-supervised pitch estimators.