Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Existing preference optimization methods rely solely on explicit outcome-based metrics—such as answer correctness—while neglecting internal logical coherence within model responses. Method: We propose a dual-metric preference optimization framework that, for the first time, introduces token-level generation probability consistency as a quantifiable, intrinsic coherence criterion, jointly modeled with answer correctness. Our approach integrates explicit (outcome-driven) and implicit (probability-distribution alignment–driven) preference signals, and seamlessly adapts to both RLHF and DPO pipelines. Contribution/Results: Evaluated on Llama, Qwen, and Phi models across mathematical reasoning benchmarks—including GSM8K, MATH, and AMC—our method consistently outperforms state-of-the-art preference optimization approaches. Empirical results demonstrate that explicitly modeling logical coherence significantly enhances LLMs’ mathematical reasoning capabilities, establishing coherence-aware preference optimization as a critical advancement in aligning language models with human-like reasoning.

Technology Category

Application Category

📝 Abstract

Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.

Problem

Research questions and friction points this paper is trying to address.

Improving mathematical reasoning in LLMs via preference optimization

Addressing neglect of internal logical coherence in current approaches

Introducing dual metrics for answer correctness and probability consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probability-Consistent Preference Optimization framework

Dual metrics for preference selection

Token-level probability consistency evaluation

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting