Mirage Fools the Ear, Mute Hides the Truth: Precise Targeted Adversarial Attacks on Polyphonic Sound Event Detection Systems

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing adversarial attacks for sound event detection (SED) suffer from insufficient precision in multi-sound scenarios and unintended perturbation of non-target time-frequency regions. To address this, we propose Mirage and Mute—two localization-aware, precisely targeted attack frameworks. Our core innovation lies in introducing a non-target region preservation loss, integrated with a context-aware optimization objective, enabling fine-grained perturbations strictly confined to user-specified time-frequency regions. We further design Editing Precision, a novel evaluation metric quantifying spatial selectivity of adversarial edits. Experiments on two state-of-the-art SED models demonstrate that Mirage and Mute achieve Editing Precision scores of 94.56% and 99.11%, respectively—substantially outperforming baselines—while maintaining high attack success rates and strong regional specificity.

Technology Category

Application Category

📝 Abstract

Sound Event Detection (SED) systems are increasingly deployed in safety-critical applications such as industrial monitoring and audio surveillance. However, their robustness against adversarial attacks has not been well explored. Existing audio adversarial attacks targeting SED systems, which incorporate both detection and localization capabilities, often lack effectiveness due to SED's strong contextual dependencies or lack precision by focusing solely on misclassifying the target region as the target event, inadvertently affecting non-target regions. To address these challenges, we propose the Mirage and Mute Attack (M2A) framework, which is designed for targeted adversarial attacks on polyphonic SED systems. In our optimization process, we impose specific constraints on the non-target output, which we refer to as preservation loss, ensuring that our attack does not alter the model outputs for non-target region, thus achieving precise attacks. Furthermore, we introduce a novel evaluation metric Editing Precison (EP) that balances effectiveness and precision, enabling our method to simultaneously enhance both. Comprehensive experiments show that M2A achieves 94.56% and 99.11% EP on two state-of-the-art SED models, demonstrating that the framework is sufficiently effective while significantly enhancing attack precision.

Problem

Research questions and friction points this paper is trying to address.

Addresses adversarial attack vulnerabilities in polyphonic sound event detection

Enhances attack precision by preserving non-target regions during optimization

Introduces a novel evaluation metric balancing effectiveness and precision

Innovation

Methods, ideas, or system contributions that make the work stand out.

M2A framework enables targeted adversarial attacks on polyphonic SED

Preservation loss constraints maintain non-target region model outputs

Editing Precision metric balances attack effectiveness with precision

🔎 Similar Papers

Audio Anti-Spoofing Detection: A Survey