🤖 AI Summary
This work proposes Decision Distribution Optimization with Reward Modeling (DDO-RM), a method that investigates whether reward-guided policy updates outperform direct pairwise optimization approaches such as Direct Preference Optimization (DPO) under minimal pairwise preference settings. Treating each prompt as a decision problem over a finite set of candidate responses, DDO-RM constructs a target distribution by centering reward model scores and distills this distribution back into the policy. Experiments on the Pythia-410m model using the binarized UltraFeedback dataset demonstrate that DDO-RM significantly surpasses DPO, achieving an average pairwise accuracy of 0.5602 (up from 0.5238), an AUC of 0.5382 (up from 0.5315), and a markedly higher average margin of 0.5353 on the held-out test set, thereby validating the efficacy of modeling and distilling the full candidate response distribution.
📝 Abstract
This paper reorganizes the current manuscript around the DPO versus DDO-RM preference-optimization project and focuses on two parts: the algorithmic view and the preliminary held-out benchmark. The benchmark asks a narrow question: even in a minimal pairwise chosen-versus-rejected setting, can a reward-guided decision-distribution update outperform a direct pairwise objective? We compare Direct Preference Optimization (DPO) against DDO-RM on EleutherAI/pythia-410m using HuggingFaceH4/ultrafeedback\_binarized, evaluate on the held-out test\_prefs split, and report results for seeds 42, 13, and 3407.
Algorithmically, DDO-RM treats each prompt as a finite decision problem over candidate responses. Instead of optimizing only a binary chosen-rejected relation, it forms a policy distribution over candidates, centers reward-model scores under that distribution, and distills a reward-guided target distribution back into the policy. In the current public benchmark, DDO-RM improves mean pair accuracy from 0.5238 to 0.5602, AUC from 0.5315 to 0.5382, and mean margin from 0.1377 to 0.5353 relative to DPO. These are encouraging but still preliminary results: the study covers one model family, one dataset, one held-out evaluation split, and three seeds.