🤖 AI Summary
Existing music similarity models produce a single holistic score that conflates multiple musical dimensions—such as melody, rhythm, and timbre—lacking both controllability and interpretability. This work proposes MERIT, a framework that achieves strongly disentangled multidimensional music representation learning under real audio conditions for the first time. By leveraging conditional audio generation and source separation techniques, the authors construct training data in which only one musical factor varies at a time, and introduce a multi-head representation architecture that enforces each head to respond exclusively to a specific musical dimension. Experiments demonstrate that each representation head significantly outperforms random baselines on its target dimension while performing near-randomly on others, confirming strong factor disentanglement across both synthetic and real-world audio. This enables fine-grained, interpretable music similarity queries.
📝 Abstract
Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.