Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Supervised fine-tuning (SFT) is constrained by fixed offline data distributions, while reinforcement learning (RL) suffers from low sample efficiency and strong dependence on high-quality base models. To unify the strengths of both paradigms, this paper proposes GRAO—a novel alignment framework. Its core innovations are: (1) multi-sample generation coupled with a group-wise direct alignment loss, enabling fine-grained preference modeling via intra-group relative ranking; and (2) a reference-aware parameter update mechanism that jointly incorporates reward signals and supervised gradients, enhancing optimization stability and model evolvability. Evaluated across diverse alignment benchmarks, GRAO outperforms SFT, DPO, PPO, and GRPO by 57.70%, 17.65%, 7.95%, and 5.18%, respectively—demonstrating substantial gains in both alignment fidelity and sample efficiency.

Technology Category

Application Category

📝 Abstract
Alignment methodologies have emerged as a critical pathway for enhancing language model alignment capabilities. While SFT (supervised fine-tuning) accelerates convergence through direct token-level loss intervention, its efficacy is constrained by offline policy trajectory. In contrast, RL(reinforcement learning) facilitates exploratory policy optimization, but suffers from low sample efficiency and stringent dependency on high-quality base models. To address these dual challenges, we propose GRAO (Group Relative Alignment Optimization), a unified framework that synergizes the respective strengths of SFT and RL through three key innovations: 1) A multi-sample generation strategy enabling comparative quality assessment via reward feedback; 2) A novel Group Direct Alignment Loss formulation leveraging intra-group relative advantage weighting; 3) Reference-aware parameter updates guided by pairwise preference dynamics. Our theoretical analysis establishes GRAO's convergence guarantees and sample efficiency advantages over conventional approaches. Comprehensive evaluations across complex human alignment tasks demonstrate GRAO's superior performance, achieving 57.70%,17.65% 7.95% and 5.18% relative improvements over SFT, DPO, PPO and GRPO baselines respectively. This work provides both a theoretically grounded alignment framework and empirical evidence for efficient capability evolution in language models.
Problem

Research questions and friction points this paper is trying to address.

Enhancing language model alignment via unified SFT and RL integration
Improving sample efficiency in alignment with multi-sample generation
Optimizing policy updates using group relative advantage weighting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-sample generation for reward feedback
Group Direct Alignment Loss weighting
Reference-aware parameter updates dynamics
🔎 Similar Papers
No similar papers found.
H
Haowen Wang
Intelligence Healthcare Department, AntGroup, Hangzhou, China
Yun Yue
Yun Yue
Ant Group
AImachine learning
Z
Zhiling Ye
Intelligence Healthcare Department, AntGroup, Hangzhou, China
Shuowen Zhang
Shuowen Zhang
The Hong Kong Polytechnic University
Wireless CommunicationMIMOIntelligent Reflecting SurfaceUAVSensing
L
Lei Fan
Intelligence Healthcare Department, AntGroup, Hangzhou, China
Jiaxin Liang
Jiaxin Liang
The Chinese University of Hong Kong
wireless communicationsystem implementationInternet of things
J
Jiadi Jiang
Intelligence Healthcare Department, AntGroup, Hangzhou, China
C
Cheng Wei
Intelligence Healthcare Department, AntGroup, Hangzhou, China
J
Jingyuan Deng
Intelligence Healthcare Department, AntGroup, Hangzhou, China
X
Xudong Han
Intelligence Healthcare Department, AntGroup, Hangzhou, China
Ji Li
Ji Li
Principal Group Science Manager at Microsoft
AICAD
C
Chunxiao Guo
Intelligence Healthcare Department, AntGroup, Hangzhou, China
P
Peng Wei
Intelligence Healthcare Department, AntGroup, Hangzhou, China
J
Jian Wang
Intelligence Healthcare Department, AntGroup, Hangzhou, China
Jinjie Gu
Jinjie Gu
ant group
机器学习,推荐