LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a novel knowledge injection vulnerability in preference-tuned language models that rely on user feedback (e.g., upvotes/downvotes): a single adversarial user can permanently corrupt model knowledge and behavior by crafting specific prompts and providing minimal, benign-looking feedback—inducing persistent generation of poisoned content even on standard inputs. We are the first to observe that preference learning mechanisms exhibit high sensitivity to low-dimensional feedback signals. Accordingly, we propose a cross-user, gradient-free, fine-grained knowledge injection attack paradigm. Our method leverages stochastic prompting to elicit either poisoned or benign responses and applies selective positive/negative feedback during preference optimization to inject false facts, exploitable code vulnerabilities, and fabricated financial news. Crucially, the injected knowledge remains stable over time—even without re-triggering prompts. This work extends the boundaries of data poisoning and prompt injection attacks and establishes a new benchmark for security evaluation of feedback-driven language models.

Technology Category

Application Category

📝 Abstract
We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a "poisoned" or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of language model preference tuning (showing that it even highly restricted forms of preference data can be used to exert fine-grained control over behavior), and a new attack mechanism for LMs trained with user feedback (extending work on pretraining-time data poisoning and deployment-time prompt injection).
Problem

Research questions and friction points this paper is trying to address.

Exploiting user feedback to inject unauthorized knowledge into LMs
Altering LM behavior via malicious upvotes and downvotes
Introducing security flaws and fake news through poisoned responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Exploiting user feedback for unauthorized knowledge injection
Stochastic poisoned or benign response manipulation
Preference tuning increases poisoned response probability
🔎 Similar Papers
No similar papers found.