Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the vulnerability of large language models (LLMs) to backdoor attacks during fine-tuning, noting that conventional trigger-based approaches are readily detected by existing defenses. To overcome this limitation, the paper introduces a novel paradigm termed “stealthy control attack,” which establishes dynamic semantic associations between commonsense knowledge and user-defined phrases in the training data. This mechanism enables invisible and generalizable backdoors capable of covertly encoding and decoding arbitrary malicious instructions. Experimental results demonstrate that the proposed method achieves an average attack success rate approximately 40% higher than prompt injection across five mainstream LLMs. Moreover, it maintains a high success rate of 93%–98% against three state-of-the-art backdoor defenses and four prompt injection countermeasures, significantly outperforming existing approaches.

📝 Abstract

Large language models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primarily rely on fixed trigger phrases that defenses such as outlier detection, clean-data regularization, or online monitoring can neutralize. In this paper, we propose a data poisoning method that teaches an LLM an information hiding scheme reliably and stealthily through semantic associations between shared knowledge such as facts or concepts and attacker-chosen phrases. The induced hiding scheme can encode and decode arbitrary malicious instructions, thus revealing a new and subtle poisoning-induced vulnerability: covert control attacks. We precisely characterize covert control attacks and evaluate them across $5$ LLMs, $3$ backdoor defenses, and $4$ prompt injection defenses. With a small poisoned fraction, covert control attacks outperform heuristic-based prompt injection attacks in average attack success rate by about $40\%$ relative to clean fine-tuned models. They also circumvent defenses based on detection and fine-tuning, maintaining up to $93\%$ attack success rate after backdoor defenses and up to $98\%$ after prompt injection defenses.

Problem

Research questions and friction points this paper is trying to address.

data poisoning

covert control

large language models

backdoor attacks

prompt injection

Innovation

Methods, ideas, or system contributions that make the work stand out.

covert control attacks

data poisoning

semantic association