🤖 AI Summary
This work addresses the challenge that existing joint audio-text models struggle to effectively model negation semantics in music retrieval, often failing to reliably distinguish the presence or absence of musical attributes (e.g., “with vocals” vs. “without vocals”). To this end, we present the first systematic study of negation-aware modeling in multimodal music retrieval. We train a CLAP model from scratch on the Million Song Dataset, leveraging LP-MusicCaps-MSD captions, and introduce a negation-oriented text augmentation strategy alongside a dissimilarity-based contrastive loss function. These components explicitly separate affirmative and negated descriptions in the joint embedding space. Experimental results demonstrate that our approach significantly enhances the model’s comprehension of negation while preserving baseline retrieval performance, achieving notable improvements on both negation-related retrieval tasks and binary classification benchmarks.
📝 Abstract
Joint audio-text models are widely used for music retrieval, yet they struggle with semantic phenomena such as negation. Negation is fundamental for distinguishing the absence (or presence) of musical elements (e.g.,"with vocals"vs."without vocals"), but current systems fail to represent this reliably. In this work, we investigate and mitigate this limitation by training CLAP models from scratch on the Million Song Dataset with LP-MusicCaps-MSD captions. We introduce negation through text augmentation and a dissimilarity-based contrastive loss, designed to explicitly separate original and negated captions in the joint embedding space. To evaluate progress, we propose two protocols that frame negation modeling as retrieval and binary classification tasks. Experiments demonstrate that both methods, individually and combined, improve negation handling while largely preserving retrieval performance.