Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the critical challenge of balancing utility and privacy in preference-based fine-tuning within reinforcement learning from human feedback (RLHF), where sensitive user data is inherently involved. The authors propose a decoupled differentially private framework that, for the first time, applies privacy guarantees precisely during the reward modeling phase and subsequently optimizes the policy using the resulting private reward model. Theoretical analysis reveals the structure of the suboptimality gap induced by privacy, showing that the privacy-related error manifests as an additive term and characterizing the optimal convergence rate under given sample complexity and privacy budget constraints. Empirical evaluations on both synthetic data and the Anthropic HH-RLHF dataset demonstrate that the proposed method achieves significantly superior alignment performance under differential privacy compared to existing baselines.

Technology Category

Application Category

📝 Abstract

Preference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward model. Theoretically, we study the suboptimality gap and show that privacy contributes an additional additive term beyond the usual non-private statistical error. We also establish a minimax lower bound and show that the dominant term changes with sample size and privacy level, which in turn characterizes regimes in which the upper bound is rate-optimal up to logarithmic factors. Empirically, synthetic experiments confirm the scaling predicted by the theory, and experiments on the Anthropic HH-RLHF dataset using the Gemma-2B-IT model show stronger private alignment performance than existing differentially private baseline methods across privacy budgets.

Problem

Research questions and friction points this paper is trying to address.

privacy-preserving

reinforcement learning from human feedback

differential privacy

reward modeling

preference-based fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Privacy-Preserving Reinforcement Learning

Decoupled Reward Modeling

Differential Privacy