Rectify Evaluation Preference: Improving LLMs'Critique on Math Reasoning via Perplexity-aware Reinforcement Learning

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the “imbalanced evaluation bias” in large language models (LLMs) during multi-step mathematical reasoning (MsMR) assessment—specifically, their overreliance on low-perplexity surface cues, leading to erroneous validation of flawed reasoning paths. We formally define this bias and introduce the One-to-Many Problem-Solution (OPS) benchmark for quantitative analysis. Methodologically, we propose a perplexity-aware reinforcement learning framework that incorporates token-level perplexity as an auxiliary signal for policy optimization, integrated within a Group Relative Policy Optimization (GRPO) paradigm to jointly optimize statistical preference modeling and critical reasoning capability. Evaluated on our curated OPS benchmark and multiple public critique benchmarks, the method achieves significant improvements in judgment accuracy, demonstrating its effectiveness, robustness, and generalizability across diverse reasoning tasks and model families.

Technology Category

Application Category

📝 Abstract
To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason -- imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs'critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon -- ``LLMs incline to judge solutions with lower perplexity as correct'', which is dubbed as extit{imbalanced evaluation preference}. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.
Problem

Research questions and friction points this paper is trying to address.

Rectify LLMs' imbalanced evaluation preference in math reasoning critique
Address bias where LLMs favor solutions with lower perplexity scores
Improve automated critiquing of multi-step mathematical reasoning processes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Perplexity-aware reinforcement learning rectifies evaluation preference
Group Relative Policy Optimization guides LLMs' judgment
Algorithm adjusts LLMs' bias toward lower perplexity solutions
🔎 Similar Papers
No similar papers found.
C
Changyuan Tian
Aerospace Information Research Institute, Chinese Academy of Sciences
Zhicong Lu
Zhicong Lu
Assistant Professor, George Mason University
HCIsocial computinglive streamingcreativity supportintangible cultural heritage
Shuang Qian
Shuang Qian
Imperial College London
Cardiac digital twinsComputational modelling
N
Nayu Liu
School of Computer Science and Technology, Tiangong University
Peiguang Li
Peiguang Li
Meituan Group
自然语言处理
L
Li Jin
Aerospace Information Research Institute, Chinese Academy of Sciences
L
Leiyi Hu
Aerospace Information Research Institute, Chinese Academy of Sciences
Z
Zhizhao Zeng
Meituan
Sirui Wang
Sirui Wang
Meituan
NLPLLM
K
Ke Zeng
Meituan
Zhi Guo
Zhi Guo
Aerospace Information Research Institute, Chinese Academy of Sciences