🤖 AI Summary
This work investigates the theoretical impact of noisy labels in offline alignment under the coupled effects of local differential privacy (LDP) and adversarial label corruption. We consider two sequential scenarios: “Label-then-Corrupt” (LTC) and “Corrupt-then-Label” (CTL). Under a linear reward modeling assumption, we develop a reduction framework that recasts RLHF and DPO optimization as noisy logistic regression parameter estimation, integrating tools from differential privacy and robust statistics. We establish, for the first time, a strict optimality separation: LTC is provably harder to optimize than CTL, yielding a formal theoretical gap in their achievable performance. Departing from prior single-factor analyses—focusing exclusively on either privacy or corruption—we propose the first unified theoretical framework characterizing offline alignment capability under *multiple concurrent noise sources*. This advances the foundational understanding of safe, robust preference alignment and significantly extends the theoretical frontier of alignment under security constraints.
📝 Abstract
In this paper, we theoretically investigate the effects of noisy labels in offline alignment, with a focus on the interplay between privacy and robustness against adversarial corruption. Specifically, under linear modeling assumptions, we present a unified analysis covering both reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under different privacy-corruption scenarios, such as Local differential privacy-then-Corruption (LTC), where human preference labels are privatized before being corrupted by an adversary, and Corruption-then-Local differential privacy (CTL), where labels are corrupted before privacy protection. Our analysis leverages a reduction framework that reduces the offline alignment problem under linear modeling assumptions to parameter estimation in logistic regression. This framework allows us to establish an interesting separation result between LTC and CTL, demonstrating that LTC presents a greater challenge than CTL in offline alignment, even under linear models. As important by-products, our findings also advance the state-of-the-art theoretical results in offline alignment under privacy-only or corruption-only scenarios.