🤖 AI Summary
This work exposes a critical security vulnerability in RLHF/DPO alignment: preference label-flipping attacks can arbitrarily steer a language model’s policy with minimal label corruption—without altering model outputs. To this end, we propose the first convex-optimization-based framework for minimum-cost poisoning attacks, rigorously deriving tight upper and lower bounds on attack cost and establishing theoretical dependencies of cost on reward model feature dimensionality and dataset size. We further design a novel post-processing method for label flipping that significantly reduces the number of required flips compared to prior attacks. Experiments demonstrate that our approach is especially effective in low-dimensional reward feature settings, reducing label-flip cost by several orders of magnitude. Our framework provides a verifiable benchmark for low-cost alignment poisoning and opens new avenues for robustness analysis and defense design against preference-based adversarial alignment.
📝 Abstract
Large language models (LLMs) are increasingly deployed in real-world systems, making it critical to understand their vulnerabilities. While data poisoning attacks during RLHF/DPO alignment have been studied empirically, their theoretical foundations remain unclear. We investigate the minimum-cost poisoning attack required to steer an LLM's policy toward an attacker's target by flipping preference labels during RLHF/DPO, without altering the compared outputs. We formulate this as a convex optimization problem with linear constraints, deriving lower and upper bounds on the minimum attack cost. As a byproduct of this theoretical analysis, we show that any existing label-flipping attack can be post-processed via our proposed method to reduce the number of label flips required while preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model's feature dimension is small relative to the dataset size. These findings highlight fundamental vulnerabilities in RLHF/DPO pipelines and provide tools to evaluate their robustness against low-cost poisoning attacks.