🤖 AI Summary
This work addresses the lack of convergence guarantees for Direct Preference Optimization (DPO) in distributed settings, where preference data are heterogeneous and scattered across clients. It establishes the first theoretical convergence framework for distributed DPO, encompassing both federated and decentralized paradigms. By integrating tools from federated learning, decentralized optimization, non-IID preference modeling, and spectral graph theory, the study systematically analyzes how client drift, communication frequency, and preference heterogeneity affect optimization dynamics. The analysis yields sufficient conditions for convergence over general communication graphs. Empirical evaluations on standard alignment benchmarks demonstrate that the proposed approach achieves strong theoretical guarantees while maintaining the robustness and scalability required for practical deployment.
📝 Abstract
Preference-based reinforcement learning (RL) is a key paradigm for aligning policies with human judgments, yet its theoretical behavior in distributed settings where preference data are fragmented across heterogeneous users remains poorly understood. Direct Preference Optimization (DPO) avoids explicit reward modeling but lacks convergence guarantees under federated and decentralized training, where communication constraints and non-IID preferences fundamentally alter optimization dynamics. We provide the first convergence and time-complexity analysis of DPO in distributed environments. Modeling personalized offline RL with user-specific preference distributions, we characterize the induced global optimization landscape. For federated DPO, we derive convergence rates that quantify the impact of client drift, communication frequency, and preference heterogeneity; for decentralized DPO, we establish convergence over general communication graphs and show how spectral connectivity governs optimization speed and consensus. Empirically, we corroborate our theoretical insights on standard alignment benchmarks, demonstrating that our proposed methods not only enjoy strong theoretical guarantees but also deliver robust and scalable performance in practice. The code base is available here.