🤖 AI Summary
This work uncovers a novel vulnerability of dense retrieval systems under adversarial attacks: a stealthy backdoor attack leveraging grammatical errors. The attacker poisons only 0.048% of the training corpus by injecting semantically coherent yet malicious passages, enabling the model to retrieve harmful content (e.g., hate speech) *only* when user queries contain minor grammatical errors—while preserving full performance on clean queries. Crucially, this is the first approach to use grammatical perturbations—as opposed to model-weight modifications—as the backdoor trigger. We identify that contrastive learning loss exhibits high sensitivity to such syntactic perturbations, and hard negative sampling markedly amplifies backdoor susceptibility. Extensive experiments demonstrate that the attack remains highly robust and stealthy against three major defense categories (input sanitization, representation regularization, and outlier detection). Our findings provide new insights and empirical evidence for security evaluation of dense retrieval systems.
📝 Abstract
Dense retrieval systems have been widely used in various NLP applications. However, their vulnerabilities to potential attacks have been underexplored. This paper investigates a novel attack scenario where the attackers aim to mislead the retrieval system into retrieving the attacker-specified contents. Those contents, injected into the retrieval corpus by attackers, can include harmful text like hate speech or spam. Unlike prior methods that rely on model weights and generate conspicuous, unnatural outputs, we propose a covert backdoor attack triggered by grammar errors. Our approach ensures that the attacked models can function normally for standard queries while covertly triggering the retrieval of the attacker's contents in response to minor linguistic mistakes. Specifically, dense retrievers are trained with contrastive loss and hard negative sampling. Surprisingly, our findings demonstrate that contrastive loss is notably sensitive to grammatical errors, and hard negative sampling can exacerbate susceptibility to backdoor attacks. Our proposed method achieves a high attack success rate with a minimal corpus poisoning rate of only 0.048%, while preserving normal retrieval performance. This indicates that the method has negligible impact on user experience for error-free queries. Furthermore, evaluations across three real-world defense strategies reveal that the malicious passages embedded within the corpus remain highly resistant to detection and filtering, underscoring the robustness and subtlety of the proposed attack.