Square$chi$PO: Differentially Private and Robust $chi^2$-Preference Optimization in Offline Direct Alignment

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This paper investigates the theoretical problem of aligning offline language models with human preferences under simultaneous constraints of preference-label contamination and differential privacy. Methodologically, it proposes a χ²-based preference-aware squared-loss surrogate for log-loss, integrates Huber contamination modeling with local/centralized differential privacy mechanisms, and establishes a unified least-squares generalization error analysis framework. Key contributions include: (i) the first optimal convergence rate under single-policy concentrability; (ii) the first centralized privacy-preserving alignment algorithm jointly protecting prompts, responses, and preference labels; (iii) the first robust theoretical guarantee for joint contamination-and-privacy mitigation under general function approximation; and (iv) the discovery of a critical separation phenomenon—namely, the non-commutativity of contamination correction and privacy noise injection. All results are derived within the unified framework; the generalization bounds possess independent theoretical significance and achieve state-of-the-art guarantees under both constraints.

Technology Category

Application Category

📝 Abstract

In this paper, we theoretically study the offline alignment of language models with human preference feedback, under both preference label corruption and privacy protections. To this end, we propose Square$chi$PO, a simple one-line change to $chi$PO where the standard log-loss is replaced by a new square loss over probability. Thanks to the inherent properties of this new loss, we have advanced the state-of-the-art of differentially private and robust offline direct alignment. Specifically, for the local model of label privacy, Square$chi$PO is the first algorithm that attains an optimal rate based on single-policy concentrability even with general function approximations. It also gives the first result under the central model of privacy protection over both prompts (responses) and labels. On the robustness side against Huber label corruption, Square$chi$PO is the first alignment method that has a meaningful theoretical guarantee under general function approximations. More importantly, Square$chi$PO can address privacy protection and corruption simultaneously, where an interesting separation is observed, implying that the order of privacy and corruption matters. Furthermore, we show that Square$chi$PO can also be easily extended to handle the scenario of the general preference model with state-of-the-art guarantees under corruption and privacy. Last but not least, all of our theoretical guarantees enjoy a unified analysis, building upon a new result on the generalization error bounds of least-square regression under corruption and privacy constraints, which we believe is of independent interest to the community.

Problem

Research questions and friction points this paper is trying to address.

Aligns language models with human preferences offline

Ensures privacy and robustness against label corruption

Uses square loss for optimal performance guarantees

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses square loss for robust preference optimization

Achieves optimal privacy with single-policy concentrability

Handles both privacy and label corruption simultaneously

🔎 Similar Papers

The Crucial Role of Samplers in Online Direct Preference Optimization