🤖 AI Summary
Current music information retrieval (MIR) systems are highly vulnerable to imperceptible adversarial attacks, primarily due to a severe misalignment between model feature spaces and human auditory perception—resulting in poor correlation between standard similarity metrics and subjective judgments. To address this, we propose PAMT, the first framework that integrates psychoacoustic constraints into a sequence-contrastive Transformer architecture to learn auditory-perception-aligned music representations. PAMT employs a frozen MERT encoder, coupled with a lightweight psychoacoustically conditioned projection head, and jointly optimizes contrastive learning with perception-aligned training objectives. Experiments demonstrate that PAMT achieves a Spearman correlation of 0.65 with human subjective similarity ratings—significantly outperforming all baselines—and improves multi-task adversarial robustness by an average of 9.15%. These results substantively bridge the gap between learned model representations and human auditory perception.
📝 Abstract
Music Information Retrieval (MIR) systems are highly vulnerable to adversarial attacks that are often imperceptible to humans, primarily due to a misalignment between model feature spaces and human auditory perception. Existing defenses and perceptual metrics frequently fail to adequately capture these auditory nuances, a limitation supported by our initial listening tests showing low correlation between common metrics and human judgments. To bridge this gap, we introduce Perceptually-Aligned MERT Transformer (PAMT), a novel framework for learning robust, perceptually-aligned music representations. Our core innovation lies in the psychoacoustically-conditioned sequential contrastive transformer, a lightweight projection head built atop a frozen MERT encoder. PAMT achieves a Spearman correlation coefficient of 0.65 with subjective scores, outperforming existing perceptual metrics. Our approach also achieves an average of 9.15% improvement in robust accuracy on challenging MIR tasks, including Cover Song Identification and Music Genre Classification, under diverse perceptual adversarial attacks. This work pioneers architecturally-integrated psychoacoustic conditioning, yielding representations significantly more aligned with human perception and robust against music adversarial attacks.