UniAPL: A Unified Adversarial Preference Learning Framework for Instruct-Following

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address the policy-expert distribution mismatch, inefficient knowledge utilization, and training instability arising from the strict separation of supervised fine-tuning (SFT) and reinforcement learning (RL) in LLM alignment, this paper proposes a unified single-stage preference learning framework. Our method jointly models demonstration-based and comparison-based preference data, enabling end-to-end co-optimization of the policy and expert distributions over mixed batches via adversarial preference learning and constrained optimization. The key innovation lies in eliminating the modality boundary between SFT and RL, thereby achieving dynamic distribution alignment and enhanced behavioral consistency. Experiments on the Qwen3 series demonstrate that our approach significantly outperforms the strong baseline GRPO: the 0.6B variant matches the performance of a 32B model while generating outputs substantially closer to expert demonstrations.

Technology Category

Application Category

📝 Abstract

Shaping powerful LLMs to be beneficial and safe is central to AI alignment. We argue that post-training alignment is fundamentally a unified Preference Learning problem, involving two modalities: demonstrated preferences (e.g., Supervised Fine-Tuning, SFT) and comparative preferences (e.g., Reinforcement Learning, RL).The standard sequential pipeline-SFT followed by RL-is flawed due to a critical distributional mismatch: SFT uses static expert data, but as the policy evolves, its generation distribution drifts, making SFT knowledge brittle. Subsequent RL then explores without direct access to the rich, ground-truth knowledge in expert demonstrations, leading to inefficient, ungrounded updates. This separation prevents mutual regularization between data sources. To address this, we reframe alignment as a constrained optimization problem and propose Unified Adversarial Preference Learning (UniAPL),a novel framework that dynamically aligns the policy's distribution with the expert's. UniAPL implements a single-stage unified training objective, jointly learning from mixed batches of SFT and preference data. In every gradient step, dense expert demonstrations directly ground and regularize online exploration, inherently resolving distributional mismatch and maximizing data synergy.We evaluate UniAPL on instruction-following tasks using Qwen3-235B-Instruct-2507 as the teacher. Our models match or exceed strong GRPO baselines: +5.77% on Qwen3-0.6B (matching a 32B model) and +3.75% on Qwen3-4B,even outperforming the teacher. Analyses of response length and log-probability distributions confirm that UniAPL outputs closely mimic expert demonstrations, achieving both stronger performance and better behavioral alignment.

Problem

Research questions and friction points this paper is trying to address.

Addresses distribution mismatch between supervised and reinforcement learning

Unifies preference learning for AI alignment in instruction-following models

Enables joint training with expert demonstrations and preference data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified adversarial preference learning for alignment

Single-stage training with mixed SFT and preference data

Dynamic policy alignment with expert demonstrations

🔎 Similar Papers

No similar papers found.