Aligning Large Language Models via Self-Steering Optimization

πŸ“… 2024-10-22
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing automated alignment methods prioritize data generation while neglecting rigorous quality control, resulting in noisy and unreliable preference signals that hinder iterative optimization. To address this, we propose Self-Steering Optimization (SSO), the first fully bootstrapped, end-to-end preference learning framework requiring neither human annotations nor external models. SSO introduces a principle-driven, dynamic steering mechanism that online maintains a consistent quality gap between chosen and rejected responses while ensuring both remain within the current policy’s capability envelope. It integrates intra-policy preference generation, response-quality contrastive modeling, and a hybrid online-offline training paradigm. Evaluated on Qwen2 and Llama3.1, SSO achieves significant improvements across six major objective and subjective benchmarks. Notably, the generated preference data boosts RewardBench scores substantially, demonstrating both high accuracy and strong generalization capability.

Technology Category

Application Category

πŸ“ Abstract
Automated alignment develops alignment systems with minimal human intervention. The key to automated alignment lies in providing learnable and accurate preference signals for preference learning without human annotation. In this paper, we introduce Self-Steering Optimization ($SSO$), an algorithm that autonomously generates high-quality preference signals based on predefined principles during iterative training, eliminating the need for manual annotation. $SSO$ maintains the accuracy of signals by ensuring a consistent gap between chosen and rejected responses while keeping them both on-policy to suit the current policy model's learning capacity. $SSO$ can benefit the online and offline training of the policy model, as well as enhance the training of reward models. We validate the effectiveness of $SSO$ with two foundation models, Qwen2 and Llama3.1, indicating that it provides accurate, on-policy preference signals throughout iterative training. Without any manual annotation or external models, $SSO$ leads to significant performance improvements across six subjective or objective benchmarks. Besides, the preference data generated by $SSO$ significantly enhanced the performance of the reward model on Rewardbench. Our work presents a scalable approach to preference optimization, paving the way for more efficient and effective automated alignment.
Problem

Research questions and friction points this paper is trying to address.

Autonomous generation of high-quality preference data for LLMs
Lack of quality control in automated alignment systems
Improving human preference alignment and reward optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autonomous high-quality preference data generation
Specialized optimization objective for data generator
Scalable framework for preference optimization
πŸ”Ž Similar Papers
2024-06-05arXiv.orgCitations: 1