🤖 AI Summary
This work addresses the systematic bias inherent in current AI feedback mechanisms—such as LLM-as-Judge—which limits their ability to replace high-quality human preference labels in alignment training. To mitigate this issue, the authors propose a general statistical framework that effectively integrates external human feedback under heterogeneous data distributions through residual correction and density ratio weighting. Building on this framework, they develop two debiased algorithms: DDPO, which preserves the computational efficiency of DPO, and DIPO, which avoids parametric reward modeling altogether and achieves the semiparametric efficiency bound. Experimental results across sentiment generation, summarization, and single-turn dialogue tasks demonstrate that the proposed approach substantially improves alignment performance, closely approaching the performance of an oracle model trained on fully human-annotated preferences.
📝 Abstract
Modern alignment pipelines are increasingly replacing expensive human preference labels with evaluations from large language models (LLM-as-Judge). However, AI labels can be systematically biased compared to high-quality human feedback datasets. In this paper, we develop two debiased alignment methods within a general framework that accommodates heterogeneous prompt-response distributions and external human feedback sources. Debiased Direct Preference Optimization (DDPO) augments standard DPO with a residual-based correction and density-ratio reweighting to mitigate systematic bias, while retaining DPO's computational efficiency. Debiased Identity Preference Optimization (DIPO) directly estimates human preference probabilities without imposing a parametric reward model. We provide theoretical guarantees for both methods: DDPO offers a practical and computationally efficient solution for large-scale alignment, whereas DIPO serves as a robust, statistically optimal alternative that attains the semiparametric efficiency bound. Empirical studies on sentiment generation, summarization, and single-turn dialogue demonstrate that the proposed methods substantially improve alignment efficiency and recover performance close to that of an oracle trained on fully human-labeled data.