Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing preference optimization methods rely solely on explicit outcome-based metrics—such as answer correctness—while neglecting internal logical coherence within model responses. Method: We propose a dual-metric preference optimization framework that, for the first time, introduces token-level generation probability consistency as a quantifiable, intrinsic coherence criterion, jointly modeled with answer correctness. Our approach integrates explicit (outcome-driven) and implicit (probability-distribution alignment–driven) preference signals, and seamlessly adapts to both RLHF and DPO pipelines. Contribution/Results: Evaluated on Llama, Qwen, and Phi models across mathematical reasoning benchmarks—including GSM8K, MATH, and AMC—our method consistently outperforms state-of-the-art preference optimization approaches. Empirical results demonstrate that explicitly modeling logical coherence significantly enhances LLMs’ mathematical reasoning capabilities, establishing coherence-aware preference optimization as a critical advancement in aligning language models with human-like reasoning.

Technology Category

Application Category

📝 Abstract
Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.
Problem

Research questions and friction points this paper is trying to address.

Improving mathematical reasoning in LLMs via preference optimization
Addressing neglect of internal logical coherence in current approaches
Introducing dual metrics for answer correctness and probability consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probability-Consistent Preference Optimization framework
Dual metrics for preference selection
Token-level probability consistency evaluation
🔎 Similar Papers
No similar papers found.
Yunqiao Yang
Yunqiao Yang
City University of Hong Kong
Transfer LearningMachine Learning
Houxing Ren
Houxing Ren
Beihang University
Zimu Lu
Zimu Lu
Ph.D. student at the Chinese University of Hong Kong
AI ReasoningLarge Language Model
K
Ke Wang
CUHK MMLab
W
Weikang Shi
CUHK MMLab
Aojun Zhou
Aojun Zhou
The Chinese University of Hong Kong
Deep Learning
J
Junting Pan
CUHK MMLab, CPII under InnoHK
M
Mingjie Zhan
SenseTime Research
H
Hongsheng Li
CUHK MMLab, CPII under InnoHK, Shanghai AI Laboratory