Direct Advantage Regression: Aligning LLMs with Online AI Reward

📅 2025-04-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Online AI feedback (OAIF) suffers from coarse-grained reward signals—e.g., binary preferences—that limit the alignment precision of large language models (LLMs). Method: This paper proposes Direct Advantage Regression (DAR), a reinforcement-learning-free approach that leverages continuous scalar rewards generated online by AI to directly model the advantage function. DAR then applies advantage-based weighting to supervise fine-tuning samples, enabling fine-grained alignment without policy optimization or environment interaction. Contribution/Results: Theoretically, DAR preserves consistency with online RLHF while eliminating RL’s computational and implementation complexities. Empirical evaluation using GPT-4-Turbo and MT-bench demonstrates that DAR significantly improves agreement between human evaluations and AI feedback, outperforming state-of-the-art OAIF and online RLHF baselines across all metrics. By enabling efficient, high-resolution, low-complexity LLM alignment, DAR establishes a novel paradigm for scalable and precise preference learning.

Technology Category

Application Category

📝 Abstract
Online AI Feedback (OAIF) presents a promising alternative to Reinforcement Learning from Human Feedback (RLHF) by utilizing online AI preference in aligning language models (LLMs). However, the straightforward replacement of humans with AI deprives LLMs from learning more fine-grained AI supervision beyond binary signals. In this paper, we propose Direct Advantage Regression (DAR), a simple alignment algorithm using online AI reward to optimize policy improvement through weighted supervised fine-tuning. As an RL-free approach, DAR maintains theoretical consistency with online RLHF pipelines while significantly reducing implementation complexity and improving learning efficiency. Our empirical results underscore that AI reward is a better form of AI supervision consistently achieving higher human-AI agreement as opposed to AI preference. Additionally, evaluations using GPT-4-Turbo and MT-bench show that DAR outperforms both OAIF and online RLHF baselines.
Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with online AI reward signals
Overcoming limitations of binary AI preference feedback
Simplifying RL-free alignment while improving efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct Advantage Regression for alignment
Online AI reward replaces human feedback
RL-free approach improves efficiency
🔎 Similar Papers
No similar papers found.