🤖 AI Summary
This paper addresses the lack of interpretable, reference-free quality assessment methods for generative music accompaniment. We propose an audio contrastive learning framework explicitly designed to evaluate harmonic and rhythmic consistency between accompaniment and melody. Methodologically, we introduce voice-level consistency modeling into contrastive learning for the first time, leveraging multi-voice features extracted via harmonic-percussive source separation (HPS) to construct audio representations that capture fine-grained harmonic compatibility and rhythmic alignment. Our contributions are threefold: (1) the first voice-level contrastive learning paradigm enabling reference-free, interpretable evaluation; (2) empirical validation on four public multi-instrument datasets—including MUSDB18-HQ—demonstrating significantly improved correlation between automated scores and human perceptual judgments; and (3) open-sourcing of both model and code to advance standardization of objective evaluation for generative accompaniment.
📝 Abstract
We present COCOLA (Coherence-Oriented Contrastive Learning for Audio), a contrastive learning method for musical audio representations that captures the harmonic and rhythmic coherence between samples. Our method operates at the level of the stems composing music tracks and can input features obtained via Harmonic-Percussive Separation (HPS). COCOLA allows the objective evaluation of generative models for music accompaniment generation, which are difficult to benchmark with established metrics. In this regard, we evaluate recent music accompaniment generation models, demonstrating the effectiveness of the proposed method. We release the model checkpoints trained on public datasets containing separate stems (MUSDB18-HQ, MoisesDB, Slakh2100, and CocoChorales).