🤖 AI Summary
Beat tracking exhibits inherent ambiguity: a single musical excerpt may correspond to multiple plausible beat sequences (e.g., due to inter-subject variability in rhythmic perception), which conventional supervised learning struggles to model. To address this, we propose a knowledge-guided multi-hypothesis contrastive self-supervised pretraining framework. Our approach features: (1) a music-theory-informed mechanism for generating multiple plausible beat hypotheses; (2) a domain-knowledge-driven scoring function to select high-fidelity positive samples, thereby enhancing rhythmic semantic discrimination in contrastive learning; and (3) explicit modeling of perceptual diversity in human beat perception during pretraining. After fine-tuning on standard beat tracking benchmarks, our model achieves state-of-the-art performance, significantly outperforming existing methods. This demonstrates that synergistic integration of structured musical knowledge and multi-hypothesis learning robustly improves representation learning for rhythm analysis.
📝 Abstract
Ambiguities in data and problem constraints can lead to diverse, equally plausible outcomes for a machine learning task. In beat and downbeat tracking, for instance, different listeners may adopt various rhythmic interpretations, none of which would necessarily be incorrect. To address this, we propose a contrastive self-supervised pre-training approach that leverages multiple hypotheses about possible positive samples in the data. Our model is trained to learn representations compatible with different such hypotheses, which are selected with a knowledge-based scoring function to retain the most plausible ones. When fine-tuned on labeled data, our model outperforms existing methods on standard benchmarks, showcasing the advantages of integrating domain knowledge with multi-hypothesis selection in music representation learning in particular.