Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning

📅 2025-08-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity, high cost, and poor scalability of high-quality human feedback data in preference tuning, this paper proposes an LLM self-driven iterative refinement framework. It employs a single large language model to jointly serve as both a response optimizer and a preference discriminator—eliminating the need for human annotations or external reward models—and generates high-quality preference sequences via multi-round self-improvement with an automatic termination mechanism. The core contribution is the first demonstration of a unified model fulfilling both generative and discriminative roles in preference data construction, significantly enhancing scalability and consistency of the data pipeline. Empirical evaluation across multiple benchmarks shows that models fine-tuned on data generated by our method achieve over 74% win rates against GPT-4 in pairwise comparisons, and yield +5% and +19% improvements on AlpacaEval and MT-Bench, respectively.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable progress through preference-based fine-tuning, which critically depends on the quality of the underlying training data. While human feedback is essential for improving data quality, it is costly and does not scale well. In this paper, we introduce Refine-n-Judge, an automated iterative approach that leverages a single LLM as both a refiner and a judge to enhance dataset quality. Unlike existing iterative refinement methods, Refine-n-Judge employs an LLM to both generate refinements and explicitly evaluate each improvement, ensuring that every iteration meaningfully enhances the dataset without requiring additional human annotation or a separate reward model. At each step, the LLM refines a response and judges whether the refinement is an improvement over the previous answer. This process continues until the LLM prefers the initial answer over the refinement, indicating no further improvements. This produces sequences of increasing quality, preference-labeled responses ideal for fine-tuning. We demonstrate the effectiveness of Refine-n-Judge across a range of public datasets spanning five corpora, targeting tasks such as coding, math, and conversation. Models (Llama 3.1-8B and Llama 3.3-70B) fine-tuned on Refine-n-Judge-enhanced datasets were preferred by LLM judges in over 74% of comparisons against models tuned on the original dataset by GPT-4. Additionally, we report performance gains: +5% on AlpacaEval and AlpacaEval 2.0, and +19% on MT-Bench. Our results indicate that Refine-n-Judge produces high-quality datasets and scalable model improvements.
Problem

Research questions and friction points this paper is trying to address.

Automating high-quality dataset refinement for LLM fine-tuning
Reducing reliance on costly human feedback in dataset improvement
Enhancing model performance through iterative LLM-based refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated LLM refiner and judge
Iterative refinement without human annotation
Generates quality preference-labeled sequences
🔎 Similar Papers
No similar papers found.
D
Derin Cayir
Meta Reality Labs
R
Renjie Tao
Meta Reality Labs
R
Rashi Rungta
Meta Reality Labs
K
Kai Sun
Meta Reality Labs
S
Sean Chen
Meta Reality Labs
Haidar Khan
Haidar Khan
Meta
Natural Language ProcessingMachine Learning
M
Minseok Kim
Meta Reality Labs
Julia Reinspach
Julia Reinspach
Stanford University
Y
Yue Liu
Meta Reality Labs