Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data

📅 2025-02-08

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

To address the reliance of large language model (LLM) iterative optimization on costly human annotations and its tendency to plateau in performance, this paper proposes Dynamic Noise Preference Optimization (DNPO)—a fully automated, annotation-free framework enabling stable multi-round self-improvement. DNPO integrates trainable dynamic noise injection with adaptive synthetic preference pair construction. Its core innovation lies in the first-ever noise-controllable, sample-aware synthetic data generation mechanism, synergistically combined with a variant of Direct Preference Optimization (DPO) and LLM self-feedback distillation. Evaluated on Zephyr-7B, DNPO achieves an average +2.6% improvement across multiple benchmarks. GPT-4-based pairwise evaluation shows that DNPO-generated preference data attains a 29.4% higher win rate over baseline methods, significantly alleviating the performance bottleneck inherent in synthetic-data-driven optimization.

Technology Category

Application Category

📝 Abstract

Although LLMs have achieved significant success, their reliance on large volumes of human-annotated data has limited their potential for further scaling. In this situation, utilizing self-generated synthetic data has become crucial for fine-tuning LLMs without extensive human annotation. However, current methods often fail to ensure consistent improvements across iterations, with performance stagnating after only minimal updates. To overcome these challenges, we introduce Dynamic Noise Preference Optimization (DNPO). DNPO employs a dynamic sample labeling mechanism to construct preference pairs for training and introduces controlled, trainable noise into the preference optimization process. Our approach effectively prevents stagnation and enables continuous improvement. In experiments with Zephyr-7B, DNPO consistently outperforms existing methods, showing an average performance boost of 2.6% across multiple benchmarks. Additionally, DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations. This highlights its effectiveness in enhancing model performance through iterative refinement.

Problem

Research questions and friction points this paper is trying to address.

Optimize LLM self-improvement via synthetic data

Prevent performance stagnation in iterative training

Enhance quality of model-generated data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic sample labeling mechanism

Controlled trainable noise introduction

Preference optimization for self-improvement

🔎 Similar Papers

No similar papers found.