Balanced Actor Initialization: Stable RLHF Training of Distillation-Based Reasoning Models

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses two critical training instabilities in Reinforcement Learning from Human Feedback (RLHF) alignment of distilled inference models: sequence-length collapse—premature truncation of generated outputs—and the “reward hockey-stick curve,” characterized by an abrupt reward drop followed by slow recovery. To mitigate these issues, we propose Balanced Strategy Initialization (BAI), a two-stage weighted model fusion technique that jointly initializes RLHF policy parameters using instruction-tuned, distilled inference–tuned, and pretrained model weights. This multi-source initialization establishes a more robust starting point for RLHF optimization. Experiments demonstrate that BAI enables, for the first time, stable end-to-end RLHF training of distilled inference models: it completely eliminates sequence-length collapse, substantially smooths the reward trajectory, and preserves strong performance across diverse benchmarks—including mathematical reasoning, code generation, and commonsense QA—without compromising either inference capability or foundational language modeling competence. Overall, BAI markedly improves RLHF training stability.

Technology Category

Application Category

📝 Abstract

The development of alignment and reasoning capabilities in large language models has seen remarkable progress through two paradigms: instruction tuning and reinforcement learning from human feedback (RLHF) alignment paradigm, and distillation-based reasoning fine-tuning paradigm. While both approaches prove effective independently, the third paradigm of applying RLHF to distillation-trained models presents significant challenges. Our investigation reveals two critical phenomena that emerge in this paradigm: Sequence Length Collapse, where language generation dramatically reduces during early RLHF training, and the Reward Hockey Stick Curve, featuring severe reward score drops followed by gradual recovery. These instabilities fundamentally compromise the model's alignment and reasoning capabilities. To address these challenges, we propose Balanced Actor Initialization (BAI), a two-stage weighted model merging approach. BAI first merges instruction-following and distillation-based reasoning fine-tuned models, then further combines this intermediate model with the pretrained model to preserve foundational knowledge. Through comprehensive experiments across diverse benchmarks and detailed analysis of training experiments, we demonstrate that BAI resolves Sequence Length Collapse, mitigates the Reward Hockey Stick Curve, and enables continuous sequence length improvement during training. Additionally, our analysis reveals that balanced merging ratios achieve optimal trade-offs between training stability and reasoning capability preservation. Our work provides the effective solution for stable training in this third paradigm, enabling more capable reasoning models that combine distillation efficiency with RLHF alignment.

Problem

Research questions and friction points this paper is trying to address.

Stable RLHF training for distillation-based reasoning models

Addressing sequence length collapse in early RLHF training

Mitigating reward hockey stick curve instability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage weighted model merging approach

Balances instruction-following and reasoning models

Preserves foundational knowledge through pretrained integration

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting