🤖 AI Summary
This work addresses speaker recognition systems by proposing the first multi-target backdoor attack method. The approach employs location-agnostic click sounds as universal triggers, activating malicious behavior without altering speech content or temporal structure. During model training, cross-speaker backdoor associations are injected, enabling simultaneous compromise of up to 50 target speakers. The method extends to speaker verification by dynamically selecting highly susceptible speaker pairs via cosine similarity scoring to maximize attack success. Experiments demonstrate high stealth and efficacy across varying signal-to-noise ratios: attack success rates reach 95.04% in speaker identification and 90% for high-similarity pairs in speaker verification. To our knowledge, this is the first audio-domain backdoor attack achieving large-scale, task-transferable, and trigger-agnostic multi-target compromise. It establishes a novel paradigm and empirical benchmark for security research on voice biometrics.
📝 Abstract
In this work, we propose a multi-target backdoor attack against speaker identification using position-independent clicking sounds as triggers. Unlike previous single-target approaches, our method targets up to 50 speakers simultaneously, achieving success rates of up to 95.04%. To simulate more realistic attack conditions, we vary the signal-to-noise ratio between speech and trigger, demonstrating a trade-off between stealth and effectiveness. We further extend the attack to the speaker verification task by selecting the most similar training speaker - based on cosine similarity - as the target. The attack is most effective when target and enrolled speaker pairs are highly similar, reaching success rates of up to 90% in such cases.