🤖 AI Summary
Music representation learning remains challenging under data-scarce conditions—e.g., niche traditional genres, non-mainstream styles, and personalized compositions—where labeled audio is extremely limited. Method: We systematically investigate model behavior with minimal training data (5–8,000 minutes), comparing CNN vs. Transformer architectures, self-supervised vs. supervised pretraining paradigms, and varying input segment durations. Evaluation spans standard MIR benchmark tasks and noise robustness analysis. Contribution/Results: Contrary to the prevailing assumption that performance scales monotonically with data volume, we find that certain models trained on limited data—and even randomly initialized ones—achieve performance comparable to fully trained baselines on specific tasks. Moreover, limited-data models approach full-data baseline performance across multiple MIR tasks, and handcrafted features remain competitive in select scenarios. These findings establish a new empirical benchmark and methodological guidance for lightweight, low-resource music AI.
📝 Abstract
Large deep-learning models for music, including those focused on learning general-purpose music audio representations, are often assumed to require substantial training data to achieve high performance. If true, this would pose challenges in scenarios where audio data or annotations are scarce, such as for underrepresented music traditions, non-popular genres, and personalized music creation and listening. Understanding how these models behave in limited-data scenarios could be crucial for developing techniques to tackle them. In this work, we investigate the behavior of several music audio representation models under limited-data learning regimes. We consider music models with various architectures, training paradigms, and input durations, and train them on data collections ranging from 5 to 8,000 minutes long. We evaluate the learned representations on various music information retrieval tasks and analyze their robustness to noise. We show that, under certain conditions, representations from limited-data and even random models perform comparably to ones from large-dataset models, though handcrafted features outperform all learned representations in some tasks.