Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data

📅 2025-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the reliance of large language model (LLM) iterative optimization on costly human annotations and its tendency to plateau in performance, this paper proposes Dynamic Noise Preference Optimization (DNPO)—a fully automated, annotation-free framework enabling stable multi-round self-improvement. DNPO integrates trainable dynamic noise injection with adaptive synthetic preference pair construction. Its core innovation lies in the first-ever noise-controllable, sample-aware synthetic data generation mechanism, synergistically combined with a variant of Direct Preference Optimization (DPO) and LLM self-feedback distillation. Evaluated on Zephyr-7B, DNPO achieves an average +2.6% improvement across multiple benchmarks. GPT-4-based pairwise evaluation shows that DNPO-generated preference data attains a 29.4% higher win rate over baseline methods, significantly alleviating the performance bottleneck inherent in synthetic-data-driven optimization.

Technology Category

Application Category

📝 Abstract
Although LLMs have achieved significant success, their reliance on large volumes of human-annotated data has limited their potential for further scaling. In this situation, utilizing self-generated synthetic data has become crucial for fine-tuning LLMs without extensive human annotation. However, current methods often fail to ensure consistent improvements across iterations, with performance stagnating after only minimal updates. To overcome these challenges, we introduce Dynamic Noise Preference Optimization (DNPO). DNPO employs a dynamic sample labeling mechanism to construct preference pairs for training and introduces controlled, trainable noise into the preference optimization process. Our approach effectively prevents stagnation and enables continuous improvement. In experiments with Zephyr-7B, DNPO consistently outperforms existing methods, showing an average performance boost of 2.6% across multiple benchmarks. Additionally, DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations. This highlights its effectiveness in enhancing model performance through iterative refinement.
Problem

Research questions and friction points this paper is trying to address.

Optimize LLM self-improvement via synthetic data
Prevent performance stagnation in iterative training
Enhance quality of model-generated data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic sample labeling mechanism
Controlled trainable noise introduction
Preference optimization for self-improvement
🔎 Similar Papers
No similar papers found.