๐ค AI Summary
This work addresses key limitations in existing language model alignment methods, which often rely on specific human preference models such as BradleyโTerry, lack statistical consistency, and suffer from instability due to unbounded direct density ratio optimization. To overcome these issues, the authors propose a novel alignment approach based on bounded relative density ratios, modeling the ratio between the preference data distribution and a mixture of preference and reference distributions. By abandoning conventional assumptions of unbounded density ratios and explicit preference models, the method achieves both training stability and statistical consistency. Evaluated on Qwen 2.5 and Llama 3, the proposed technique substantially outperforms DDRO in alignment performance, enhances training stability, and provides tighter theoretical guarantees for convergence.
๐ Abstract
Aligning language models with human preferences is essential for ensuring their safety and reliability. Although most existing approaches assume specific human preference models such as the Bradley-Terry model, this assumption may fail to accurately capture true human preferences, and consequently, these methods lack statistical consistency, i.e., the guarantee that language models converge to the true human preference as the number of samples increases. In contrast, direct density ratio optimization (DDRO) achieves statistical consistency without assuming any human preference models. DDRO models the density ratio between preferred and non-preferred data distributions using the language model, and then optimizes it via density ratio estimation. However, this density ratio is unstable and often diverges, leading to training instability of DDRO. In this paper, we propose a novel alignment method that is both stable and statistically consistent. Our approach is based on the relative density ratio between the preferred data distribution and a mixture of the preferred and non-preferred data distributions. Our approach is stable since this relative density ratio is bounded above and does not diverge. Moreover, it is statistically consistent and yields significantly tighter convergence guarantees than DDRO. We experimentally show its effectiveness with Qwen 2.5 and Llama 3.