π€ AI Summary
This work addresses the vulnerability of large language models (LLMs) to backdoor attacks during fine-tuning, noting that conventional trigger-based approaches are readily detected by existing defenses. To overcome this limitation, the paper introduces a novel paradigm termed βstealthy control attack,β which establishes dynamic semantic associations between commonsense knowledge and user-defined phrases in the training data. This mechanism enables invisible and generalizable backdoors capable of covertly encoding and decoding arbitrary malicious instructions. Experimental results demonstrate that the proposed method achieves an average attack success rate approximately 40% higher than prompt injection across five mainstream LLMs. Moreover, it maintains a high success rate of 93%β98% against three state-of-the-art backdoor defenses and four prompt injection countermeasures, significantly outperforming existing approaches.
π Abstract
Large language models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primarily rely on fixed trigger phrases that defenses such as outlier detection, clean-data regularization, or online monitoring can neutralize. In this paper, we propose a data poisoning method that teaches an LLM an information hiding scheme reliably and stealthily through semantic associations between shared knowledge such as facts or concepts and attacker-chosen phrases. The induced hiding scheme can encode and decode arbitrary malicious instructions, thus revealing a new and subtle poisoning-induced vulnerability: covert control attacks.
We precisely characterize covert control attacks and evaluate them across $5$ LLMs, $3$ backdoor defenses, and $4$ prompt injection defenses. With a small poisoned fraction, covert control attacks outperform heuristic-based prompt injection attacks in average attack success rate by about $40\%$ relative to clean fine-tuned models. They also circumvent defenses based on detection and fine-tuning, maintaining up to $93\%$ attack success rate after backdoor defenses and up to $98\%$ after prompt injection defenses.