DDO-RM for LLM Preference Optimization: A Minimal Held-Out Benchmark against DPO

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work proposes Decision Distribution Optimization with Reward Modeling (DDO-RM), a method that investigates whether reward-guided policy updates outperform direct pairwise optimization approaches such as Direct Preference Optimization (DPO) under minimal pairwise preference settings. Treating each prompt as a decision problem over a finite set of candidate responses, DDO-RM constructs a target distribution by centering reward model scores and distills this distribution back into the policy. Experiments on the Pythia-410m model using the binarized UltraFeedback dataset demonstrate that DDO-RM significantly surpasses DPO, achieving an average pairwise accuracy of 0.5602 (up from 0.5238), an AUC of 0.5382 (up from 0.5315), and a markedly higher average margin of 0.5353 on the held-out test set, thereby validating the efficacy of modeling and distilling the full candidate response distribution.

Technology Category

Application Category

📝 Abstract

This paper reorganizes the current manuscript around the DPO versus DDO-RM preference-optimization project and focuses on two parts: the algorithmic view and the preliminary held-out benchmark. The benchmark asks a narrow question: even in a minimal pairwise chosen-versus-rejected setting, can a reward-guided decision-distribution update outperform a direct pairwise objective? We compare Direct Preference Optimization (DPO) against DDO-RM on EleutherAI/pythia-410m using HuggingFaceH4/ultrafeedback\_binarized, evaluate on the held-out test\_prefs split, and report results for seeds 42, 13, and 3407. Algorithmically, DDO-RM treats each prompt as a finite decision problem over candidate responses. Instead of optimizing only a binary chosen-rejected relation, it forms a policy distribution over candidates, centers reward-model scores under that distribution, and distills a reward-guided target distribution back into the policy. In the current public benchmark, DDO-RM improves mean pair accuracy from 0.5238 to 0.5602, AUC from 0.5315 to 0.5382, and mean margin from 0.1377 to 0.5353 relative to DPO. These are encouraging but still preliminary results: the study covers one model family, one dataset, one held-out evaluation split, and three seeds.

Problem

Research questions and friction points this paper is trying to address.

preference optimization

reward modeling

language models

DPO

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

DDO-RM

preference optimization

reward-guided policy