🤖 AI Summary
Large language models (LLMs) often generate harmful outputs misaligned with human preferences, and existing preference alignment methods lack rigorous generalization guarantees. Method: This work investigates the generalization capability of direct preference optimization (DPO) under diverse human values. We establish the first theoretical framework for DPO generalization error under finite-step gradient updates, deriving tight generalization bounds via reward-margin trajectory analysis. Our approach integrates theoretical analysis—combining generalization bounds with reward-margin modeling—with empirical validation across multiple mainstream LLMs. Contribution/Results: We prove that DPO-trained models correctly rank preferred responses over dispreferred ones on unseen samples with high probability. Moreover, generalization performance improves robustly with increasing value diversity and sample size. This work bridges a critical gap in DPO theory, providing the first verifiable, generalization-theoretic foundation for value-aligned LLM training.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences, leading to harmful or undesirable outputs. Preference learning, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for ensuring that LLMs align with human values. Despite the widespread adoption in real-world systems, a thorough theoretical understanding of the generalization guarantees for these models remain lacking. This paper bridges that gap by introducing a new theoretical framework to analyze the generalization guarantees of models trained with direct preference optimization (DPO). While existing generalization theory often focuses on overparameterized models achieving near-optimal loss or models independent of the training process, our framework rigorously assesses how well models generalize after a finite number of gradient steps, reflecting real-world LLM training practices. By analyzing the reward margin associated with each sample and its trajectory throughout training, we can effectively bound the generalization error. We derive learning guarantees showing that, under specific conditions, models trained with DPO can correctly discern preferred responses on unseen data with high probability. These insights are empirically validated on contemporary LLMs, underscoring the practical relevance of our theoretical findings.