Training a Perceptual Model for Evaluating Auditory Similarity in Music Adversarial Attack

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Current music information retrieval (MIR) systems are highly vulnerable to imperceptible adversarial attacks, primarily due to a severe misalignment between model feature spaces and human auditory perception—resulting in poor correlation between standard similarity metrics and subjective judgments. To address this, we propose PAMT, the first framework that integrates psychoacoustic constraints into a sequence-contrastive Transformer architecture to learn auditory-perception-aligned music representations. PAMT employs a frozen MERT encoder, coupled with a lightweight psychoacoustically conditioned projection head, and jointly optimizes contrastive learning with perception-aligned training objectives. Experiments demonstrate that PAMT achieves a Spearman correlation of 0.65 with human subjective similarity ratings—significantly outperforming all baselines—and improves multi-task adversarial robustness by an average of 9.15%. These results substantively bridge the gap between learned model representations and human auditory perception.

Technology Category

Application Category

📝 Abstract

Music Information Retrieval (MIR) systems are highly vulnerable to adversarial attacks that are often imperceptible to humans, primarily due to a misalignment between model feature spaces and human auditory perception. Existing defenses and perceptual metrics frequently fail to adequately capture these auditory nuances, a limitation supported by our initial listening tests showing low correlation between common metrics and human judgments. To bridge this gap, we introduce Perceptually-Aligned MERT Transformer (PAMT), a novel framework for learning robust, perceptually-aligned music representations. Our core innovation lies in the psychoacoustically-conditioned sequential contrastive transformer, a lightweight projection head built atop a frozen MERT encoder. PAMT achieves a Spearman correlation coefficient of 0.65 with subjective scores, outperforming existing perceptual metrics. Our approach also achieves an average of 9.15% improvement in robust accuracy on challenging MIR tasks, including Cover Song Identification and Music Genre Classification, under diverse perceptual adversarial attacks. This work pioneers architecturally-integrated psychoacoustic conditioning, yielding representations significantly more aligned with human perception and robust against music adversarial attacks.

Problem

Research questions and friction points this paper is trying to address.

Misalignment between model features and human auditory perception

Existing metrics fail to capture auditory nuances adequately

Low correlation between common metrics and human judgments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Psychoacoustically-conditioned sequential contrastive transformer

Lightweight projection head on frozen MERT encoder

Architecturally-integrated psychoacoustic conditioning for alignment

🔎 Similar Papers

No similar papers found.