Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems

📅 2025-04-01

📈 Citations: 0

✨ Influential: 0

career value

265K/year

🤖 AI Summary

To mitigate voice privacy leakage risks posed by commercial and large-language-model (LLM)-driven automatic speech recognition (ASR) systems in real-time voice communication, this paper proposes AudioShield. Methodologically, it introduces (1) latent-space transferable universal adversarial perturbations (LS-TUAPs)—the first of their kind—achieving strong cross-model transferability while preserving high audio fidelity (MOS ≥ 4.2/5); and (2) a target-text feature adaptive embedding mechanism to significantly enhance generalization against black-box ASR systems. Evaluation demonstrates that AudioShield achieves state-of-the-art defense performance across four commercial ASR APIs, three voice assistants, two LLM-based ASR systems, and one conventional neural-network ASR, increasing word error rates (WER) by over 80%. It operates with end-to-end latency under 100 ms and exhibits robustness against adaptive attacks.

Technology Category

Application Category

📝 Abstract

The widespread application of automatic speech recognition (ASR) supports large-scale voice surveillance, raising concerns about privacy among users. In this paper, we concentrate on using adversarial examples to mitigate unauthorized disclosure of speech privacy thwarted by potential eavesdroppers in speech communications. While audio adversarial examples have demonstrated the capability to mislead ASR models or evade ASR surveillance, they are typically constructed through time-intensive offline optimization, restricting their practicality in real-time voice communication. Recent work overcame this limitation by generating universal adversarial perturbations (UAPs) and enhancing their transferability for black-box scenarios. However, they introduced excessive noise that significantly degrades audio quality and affects human perception, thereby limiting their effectiveness in practical scenarios. To address this limitation and protect live users' speech against ASR systems, we propose a novel framework, AudioShield. Central to this framework is the concept of Transferable Universal Adversarial Perturbations in the Latent Space (LS-TUAP). By transferring the perturbations to the latent space, the audio quality is preserved to a large extent. Additionally, we propose target feature adaptation to enhance the transferability of UAPs by embedding target text features into the perturbations. Comprehensive evaluation on four commercial ASR APIs (Google, Amazon, iFlytek, and Alibaba), three voice assistants, two LLM-powered ASR and one NN-based ASR demonstrates the protection superiority of AudioShield over existing competitors, and both objective and subjective evaluations indicate that AudioShield significantly improves the audio quality. Moreover, AudioShield also shows high effectiveness in real-time end-to-end scenarios, and demonstrates strong resilience against adaptive countermeasures.

Problem

Research questions and friction points this paper is trying to address.

Protecting speech privacy from commercial ASR surveillance

Enhancing real-time adversarial example practicality

Improving audio quality while maintaining perturbation effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent space adversarial perturbations preserve audio quality

Target feature adaptation enhances perturbation transferability

Real-time protection against diverse ASR systems

🔎 Similar Papers

Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions