🤖 AI Summary
This work presents the first systematic study of audio backdoor attacks against speech-language models (SLMs), focusing on cascaded architectures comprising speech encoders and large language models (LLMs). To address the unclear propagation mechanisms of backdoors across modular components and the lack of effective defenses, we propose a component-level vulnerability analysis framework that identifies speech encoders as the primary attack surface. We further design a lightweight, fine-tuning-based defense to mitigate poisoning risks in pre-trained encoders. Extensive end-to-end evaluations are conducted across four mainstream speech encoders, three benchmark datasets, and four downstream tasks—including automatic speech recognition, sentiment, gender, and age prediction—achieving attack success rates of 90.76%–99.41%. Our defense significantly reduces backdoor activation rates, empirically validating both the traceability of cross-component backdoor propagation and the efficacy of our mitigation strategy.
📝 Abstract
Large Language Models (LLMs) and their multimodal extensions are becoming increasingly popular. One common approach to enable multimodality is to cascade domain-specific encoders with an LLM, making the resulting model inherit vulnerabilities from all of its components. In this work, we present the first systematic study of audio backdoor attacks against speech language models. We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks: automatic speech recognition (ASR), speech emotion recognition, and gender and age prediction. The attack consistently achieves high success rates, ranging from 90.76% to 99.41%. To better understand how backdoors propagate, we conduct a component-wise analysis to identify the most vulnerable stages of the pipeline. Finally, we propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders.