🤖 AI Summary
This work addresses the low sample efficiency and slow convergence of offline preference learning in language model alignment by investigating the advantages of on-policy methods. Through theoretical analysis of coverage dynamics under on-policy sampling, the authors propose the "Coverage Improvement Principle," proving that with sufficiently large batches, each policy update strictly enhances the quality of state-action coverage, leading to exponential convergence. This establishes a theoretical boundary in sample complexity between on-policy and offline approaches. Building on this insight, they design a two-stage hybrid sampler incorporating G-optimal experimental design and introduce the notion of biased coverage for reward distillation. Both theoretical analysis and empirical results demonstrate that the proposed algorithm significantly outperforms existing offline methods and achieves monotonic performance improvement.
📝 Abstract
Online on-policy preference learning algorithms for language model alignment such as online direct policy optimization (DPO) can significantly outperform their offline counterparts. We provide a theoretical explanation for this phenomenon by analyzing how the sampling policy's coverage evolves throughout on-policy training. We propose and rigorously justify the \emph{coverage improvement principle}: with sufficient batch size, each update moves into a region around the target where coverage is uniformly better, making subsequent data increasingly informative and enabling rapid convergence. In the contextual bandit setting with Bradley-Terry preferences and linear softmax policy class, we show that on-policy DPO converges exponentially in the number of iterations for batch size exceeding a generalized coverage threshold. In contrast, any learner restricted to offline samples from the initial policy suffers a slower minimax rate, leading to a sharp separation in total sample complexity. Motivated by this analysis, we further propose a simple hybrid sampler based on a novel \emph{preferential} G-optimal design, which removes dependence on coverage and guarantees convergence in just two rounds. Finally, we develop principled on-policy schemes for reward distillation in the general function class setting, and show faster noiseless rates under an alternative deviation-based notion of coverage. Experimentally, we confirm that on-policy DPO and our proposed reward distillation algorithms outperform their off-policy counterparts and enjoy stable, monotonic performance gains across iterations.