🤖 AI Summary
This work systematically investigates how key design choices affect policy performance in offline preference-based reinforcement learning from human feedback (RLHF). Focusing on mainstream methods—including DPO, IPO, and SLiC—and their variants, we propose the first unified theoretical framework that transcends conventional reparameterization-based derivations. Our analysis rigorously characterizes, from functional optimization and policy gradient bias decomposition perspectives, how loss function structure, log-likelihood normalization strategies, and data sampling mechanisms intrinsically influence convergence behavior and estimation bias. Theoretical results expose fundamental distinctions among algorithms and delineate their respective applicability boundaries. Empirical validation on summarization benchmarks confirms our core findings: normalization schemes and sampling protocols critically modulate the signal-to-noise ratio of preference signals, thereby directly governing final alignment quality.
📝 Abstract
Offline algorithms for Reinforcement Learning from Human Preferences (RLHF), which use only a fixed dataset of sampled responses given an input, and preference feedback among these responses, have gained increasing prominence in the literature on aligning language models. In this paper, we study how the different design choices made in methods such as DPO, IPO, SLiC and many variants influence the quality of the learned policy, from a theoretical perspective. Our treatment yields insights into the choices of loss function, the policy which is used to normalize log-likelihoods, and also the role of the data sampling policy. Notably, our results do not rely on the standard reparameterization-style arguments used to motivate some of the algorithms in this family, which allows us to give a unified treatment to a broad class of methods. We also conduct a small empirical study to verify some of the theoretical findings on a standard summarization benchmark.