Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning

📅 2024-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high cost and strong dependency on human preference annotations in domain-specific reward modeling, this paper proposes Dr.SoW: a training-free, unsupervised method that directly models the log-density ratio between strong and weak alignment large language models as a reward signal, enabling fully automated synthesis of high-quality preference labels. Key contributions include: (1) an empirical characterization linking reward quality to the performance gap between strong and weak models; (2) a distillation-inspired contrastive framework coupled with a domain-aware reward customization pipeline, enabling zero-shot domain adaptation; and (3) state-of-the-art results on RewardBench (82.6 overall, 91.0 on Safety and 88.0 on Reasoning sub-benchmarks), driving Llama-3-8B-Instruct to achieve +15.1% and +17.8% win rates on ArenaHard and AlpacaEval 2.0, respectively.

Technology Category

Application Category

📝 Abstract
Preference tuning relies on high-quality human preference data, which is often expensive and time-consuming to gather. In this paper, we introduce Dr.SoW (Density Ratio of Strong over Weak) a cost-effective method that eliminates the reliance for human annotation by leveraging off-the-shelf LLMs for preference data annotation. Dr.SoW uses the log-density ratio between a better-aligned and a less-aligned LLM as a reward signal. We evaluate Dr.SoW across 221 different LLM pairs and empirically find a strong correlation between the performance gap of the paired models and the quality of the reward signal. This insight provides a practical guideline for selecting LLMs for data annotation. Additionally, we introduce an end-to-end pipeline that customizes reward functions based on user query domains. Without fine-tuning, it improves accuracy on domain-specific evaluations. With a pair of Mistral-7B models, Dr.SoW achieves a RewardBench score of 82.6, outperforming the best trained reward functions from same model class and demonstrating competitive performance against SoTA models in Safety (91.0) and Reasoning (88.0) domains. Further, we preference-tune Llama-3-8B-Instruct using data annotated by Dr.SoW. Our approach pushes Llama-3-8B to achieve a 37.4 % (+15.1 %) win rate on ArenaHard and a 40.7 % (+17.8 %) win rate on length-controlled AlpacaEval 2.0.
Problem

Research questions and friction points this paper is trying to address.

Model Preference Setting
Human Dependency Reduction
Domain-specific Performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Density Ratio Comparison
Automatic Reward Signal Generation
Domain-specific Accuracy Improvement
🔎 Similar Papers
No similar papers found.