Whispers in Grammars: Injecting Covert Backdoors to Compromise Dense Retrieval Systems

📅 2024-02-21

📈 Citations: 14

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work uncovers a novel vulnerability of dense retrieval systems under adversarial attacks: a stealthy backdoor attack leveraging grammatical errors. The attacker poisons only 0.048% of the training corpus by injecting semantically coherent yet malicious passages, enabling the model to retrieve harmful content (e.g., hate speech) *only* when user queries contain minor grammatical errors—while preserving full performance on clean queries. Crucially, this is the first approach to use grammatical perturbations—as opposed to model-weight modifications—as the backdoor trigger. We identify that contrastive learning loss exhibits high sensitivity to such syntactic perturbations, and hard negative sampling markedly amplifies backdoor susceptibility. Extensive experiments demonstrate that the attack remains highly robust and stealthy against three major defense categories (input sanitization, representation regularization, and outlier detection). Our findings provide new insights and empirical evidence for security evaluation of dense retrieval systems.

Technology Category

Application Category

📝 Abstract

Dense retrieval systems have been widely used in various NLP applications. However, their vulnerabilities to potential attacks have been underexplored. This paper investigates a novel attack scenario where the attackers aim to mislead the retrieval system into retrieving the attacker-specified contents. Those contents, injected into the retrieval corpus by attackers, can include harmful text like hate speech or spam. Unlike prior methods that rely on model weights and generate conspicuous, unnatural outputs, we propose a covert backdoor attack triggered by grammar errors. Our approach ensures that the attacked models can function normally for standard queries while covertly triggering the retrieval of the attacker's contents in response to minor linguistic mistakes. Specifically, dense retrievers are trained with contrastive loss and hard negative sampling. Surprisingly, our findings demonstrate that contrastive loss is notably sensitive to grammatical errors, and hard negative sampling can exacerbate susceptibility to backdoor attacks. Our proposed method achieves a high attack success rate with a minimal corpus poisoning rate of only 0.048%, while preserving normal retrieval performance. This indicates that the method has negligible impact on user experience for error-free queries. Furthermore, evaluations across three real-world defense strategies reveal that the malicious passages embedded within the corpus remain highly resistant to detection and filtering, underscoring the robustness and subtlety of the proposed attack.

Problem

Research questions and friction points this paper is trying to address.

Investigating covert backdoor attacks on dense retrieval systems

Triggering malicious content retrieval via grammatical errors

Assessing attack resilience against real-world defense strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Covert backdoor attack using grammar errors

Contrastive loss sensitive to linguistic mistakes

Minimal corpus poisoning with high attack success

🔎 Similar Papers

A Survey of Recent Backdoor Attacks and Defenses in Large Language Models