🤖 AI Summary
Existing preference optimization methods rely solely on explicit outcome-based metrics—such as answer correctness—while neglecting internal logical coherence within model responses.
Method: We propose a dual-metric preference optimization framework that, for the first time, introduces token-level generation probability consistency as a quantifiable, intrinsic coherence criterion, jointly modeled with answer correctness. Our approach integrates explicit (outcome-driven) and implicit (probability-distribution alignment–driven) preference signals, and seamlessly adapts to both RLHF and DPO pipelines.
Contribution/Results: Evaluated on Llama, Qwen, and Phi models across mathematical reasoning benchmarks—including GSM8K, MATH, and AMC—our method consistently outperforms state-of-the-art preference optimization approaches. Empirical results demonstrate that explicitly modeling logical coherence significantly enhances LLMs’ mathematical reasoning capabilities, establishing coherence-aware preference optimization as a critical advancement in aligning language models with human-like reasoning.
📝 Abstract
Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.