Towards Effective Negation Modeling in Joint Audio-Text Models for Music

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenge that existing joint audio-text models struggle to effectively model negation semantics in music retrieval, often failing to reliably distinguish the presence or absence of musical attributes (e.g., “with vocals” vs. “without vocals”). To this end, we present the first systematic study of negation-aware modeling in multimodal music retrieval. We train a CLAP model from scratch on the Million Song Dataset, leveraging LP-MusicCaps-MSD captions, and introduce a negation-oriented text augmentation strategy alongside a dissimilarity-based contrastive loss function. These components explicitly separate affirmative and negated descriptions in the joint embedding space. Experimental results demonstrate that our approach significantly enhances the model’s comprehension of negation while preserving baseline retrieval performance, achieving notable improvements on both negation-related retrieval tasks and binary classification benchmarks.

Technology Category

Application Category

📝 Abstract

Joint audio-text models are widely used for music retrieval, yet they struggle with semantic phenomena such as negation. Negation is fundamental for distinguishing the absence (or presence) of musical elements (e.g.,"with vocals"vs."without vocals"), but current systems fail to represent this reliably. In this work, we investigate and mitigate this limitation by training CLAP models from scratch on the Million Song Dataset with LP-MusicCaps-MSD captions. We introduce negation through text augmentation and a dissimilarity-based contrastive loss, designed to explicitly separate original and negated captions in the joint embedding space. To evaluate progress, we propose two protocols that frame negation modeling as retrieval and binary classification tasks. Experiments demonstrate that both methods, individually and combined, improve negation handling while largely preserving retrieval performance.

Problem

Research questions and friction points this paper is trying to address.

negation modeling

audio-text models

music retrieval

semantic representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

negation modeling

audio-text models

contrastive loss

text augmentation