Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work exposes a critical security vulnerability in large language models (LLMs): their alignment mechanisms can be maliciously weaponized via data poisoning to enable *Subversive Alignment Injection* (SAI). SAI allows adversaries to stealthily implant targeted biases or induce topic-specific refusal behaviors—without degrading general capabilities. Crucially, SAI demonstrates, for the first time, that alignment procedures can be systematically abused to enforce covert censorship and discriminatory outcomes, evading state-of-the-art poisoning defenses. By integrating robust clustering-based evasion techniques, SAI achieves significant fairness violations with minimal contamination: just 1% poisoned training data induces a 23% ΔDP racial bias in medical QA and 27–38% ΔDP disparities in resume screening. These findings provide rigorous empirical evidence underscoring urgent safety concerns regarding LLM alignment integrity and highlight the need for alignment-robust training frameworks.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are aligned to meet ethical standards and safety requirements by training them to refuse answering harmful or unsafe prompts. In this paper, we demonstrate how adversaries can exploit LLMs' alignment to implant bias, or enforce targeted censorship without degrading the model's responsiveness to unrelated topics. Specifically, we propose Subversive Alignment Injection (SAI), a poisoning attack that leverages the alignment mechanism to trigger refusal on specific topics or queries predefined by the adversary. Although it is perhaps not surprising that refusal can be induced through overalignment, we demonstrate how this refusal can be exploited to inject bias into the model. Surprisingly, SAI evades state-of-the-art poisoning defenses including LLM state forensics, as well as robust aggregation techniques that are designed to detect poisoning in FL settings. We demonstrate the practical dangers of this attack by illustrating its end-to-end impacts on LLM-powered application pipelines. For chat based applications such as ChatDoctor, with 1% data poisoning, the system refuses to answer healthcare questions to targeted racial category leading to high bias ($ΔDP$ of 23%). We also show that bias can be induced in other NLP tasks: for a resume selection pipeline aligned to refuse to summarize CVs from a selected university, high bias in selection ($ΔDP$ of 27%) results. Even higher bias ($ΔDP$~38%) results on 9 other chat based downstream applications.

Problem

Research questions and friction points this paper is trying to address.

Exploiting LLM alignment to implant targeted bias

Triggering refusal on adversary-defined topics via poisoning

Evading state-of-the-art poisoning detection defenses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Subversive Alignment Injection poisoning attack

Exploits alignment mechanism to trigger refusals

Evades state-of-the-art poisoning detection defenses

🔎 Similar Papers

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation