Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes a critical security vulnerability in RLHF/DPO alignment: preference label-flipping attacks can arbitrarily steer a language model’s policy with minimal label corruption—without altering model outputs. To this end, we propose the first convex-optimization-based framework for minimum-cost poisoning attacks, rigorously deriving tight upper and lower bounds on attack cost and establishing theoretical dependencies of cost on reward model feature dimensionality and dataset size. We further design a novel post-processing method for label flipping that significantly reduces the number of required flips compared to prior attacks. Experiments demonstrate that our approach is especially effective in low-dimensional reward feature settings, reducing label-flip cost by several orders of magnitude. Our framework provides a verifiable benchmark for low-cost alignment poisoning and opens new avenues for robustness analysis and defense design against preference-based adversarial alignment.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly deployed in real-world systems, making it critical to understand their vulnerabilities. While data poisoning attacks during RLHF/DPO alignment have been studied empirically, their theoretical foundations remain unclear. We investigate the minimum-cost poisoning attack required to steer an LLM's policy toward an attacker's target by flipping preference labels during RLHF/DPO, without altering the compared outputs. We formulate this as a convex optimization problem with linear constraints, deriving lower and upper bounds on the minimum attack cost. As a byproduct of this theoretical analysis, we show that any existing label-flipping attack can be post-processed via our proposed method to reduce the number of label flips required while preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model's feature dimension is small relative to the dataset size. These findings highlight fundamental vulnerabilities in RLHF/DPO pipelines and provide tools to evaluate their robustness against low-cost poisoning attacks.
Problem

Research questions and friction points this paper is trying to address.

Analyzing minimum-cost label-flipping attacks on LLM alignment
Establishing theoretical bounds for poisoning RLHF/DPO training pipelines
Developing cost-reduction methods for existing poisoning attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Formulates poisoning attack as convex optimization problem
Derives theoretical bounds on minimum attack cost
Post-processes existing attacks to reduce label flips