Continuous-Utility Direct Preference Optimization

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Traditional binary preference supervision struggles to capture fine-grained quality in reasoning processes, limiting the alignment efficacy of large language models on complex reasoning tasks. This work proposes CU-DPO, a novel framework that replaces binary labels with continuous utility scores to enable fine-grained alignment across diverse prompt-driven cognitive strategies. The approach employs a two-stage decoupled training procedure: first selecting strategies via a best-vs-all mechanism, then refining strategy execution through margin-stratified contrastive learning combined with entropy regularization. Theoretical analysis reveals a sample complexity improvement of Θ(K log K). Empirical results demonstrate that, across seven base models, strategy selection accuracy improves from 35–46% to 68–78%, with in-distribution mathematical reasoning performance gaining up to 6.6 points and strong generalization observed on out-of-distribution tasks.

Technology Category

Application Category

📝 Abstract

Large language model reasoning is often treated as a monolithic capability, relying on binary preference supervision that fails to capture partial progress or fine-grained reasoning quality. We introduce Continuous Utility Direct Preference Optimization (CU-DPO), a framework that aligns models to a portfolio of prompt-based cognitive strategies by replacing binary labels with continuous scores that capture fine-grained reasoning quality. We prove that learning with K strategies yields a Theta(K log K) improvement in sample complexity over binary preferences, and that DPO converges to the entropy-regularized utility-maximizing policy. To exploit this signal, we propose a two-stage training pipeline: (i) strategy selection, which optimizes the model to choose the best strategy for a given problem via best-vs-all comparisons, and (ii) execution refinement, which trains the model to correctly execute the selected strategy using margin-stratified pairs. On mathematical reasoning benchmarks, CU-DPO improves strategy selection accuracy from 35-46 percent to 68-78 percent across seven base models, yielding consistent downstream reasoning gains of up to 6.6 points on in-distribution datasets with effective transfer to out-of-distribution tasks.

Problem

Research questions and friction points this paper is trying to address.

reasoning quality

preference optimization

continuous utility

cognitive strategies

sample complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous Utility

Direct Preference Optimization

Strategy Selection