Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in existing mathematical reasoning benchmarks, which predominantly rely on templated computations and fail to assess deep structured reasoning capabilities—such as multi-constraint coordination, constructive logical synthesis, and spatial inference—in large language models. To bridge this gap, the authors introduce ReasoningMath-Plus, a benchmark comprising 150 carefully crafted problems that emphasize reasoning under interactive constraints, constructive solution strategies, and non-trivial structural insights. They further propose a novel process-aware evaluation framework featuring human-annotated reasoning skeletons, a Hazard-aware Chain-based Rule Score (HCRS) metric, and a Process Reward Model (PRM) trained on reasoning trajectories. Experimental results reveal that while leading models achieve an answer accuracy of 5.8/10, their HCRS scores are substantially lower (averaging 4.36/10), indicating that answer correctness significantly overestimates genuine reasoning proficiency.

Technology Category

Application Category

📝 Abstract
Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reasoning skeleton to support fine-grained process-level evaluation. Alongside the dataset, we introduce HCRS (Hazard-aware Chain-based Rule Score), a deterministic step-level scoring function, and train a Process Reward Model (PRM) on the annotated reasoning traces. Empirically, while leading models attain relatively high final-answer accuracy (up to 5.8/10), HCRS-based holistic evaluation yields substantially lower scores (average 4.36/10, best 5.14/10), showing that answer-only metrics can overestimate reasoning robustness.
Problem

Research questions and friction points this paper is trying to address.

structural mathematical reasoning
reasoning benchmark
large language models
multi-constraint coordination
constructive logical synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

structural reasoning
process-aware evaluation
reasoning skeleton
HCRS
Process Reward Model
🔎 Similar Papers
No similar papers found.
Xiang Zheng
Xiang Zheng
Department of Computer Science, City University of Hong Kong
Reinforcement LearningTrustworthy AIEmbodied AI
W
Weiqi Zhai
Alibaba Group
Wei Wang
Wei Wang
Tongyi Lab, Alibaba Group
Generative Models
B
Boyu Yang
Alibaba Group
Wenbo Li
Wenbo Li
The Chinese University of Hong Kong
Computer VisionDeep Learning
R
Ruixiang Luo
Alibaba Group
H
Haoxiang Sun
Alibaba Group, Shanghai Jiao Tong University
Yucheng Wang
Yucheng Wang
ETH Zürich
Multimodal LLMSpeech UnderstandingHuman-Computer Interaction
Z
Zhengze Li
Alibaba Group
M
Meng Wang
Alibaba Group
Y
Yuetian Du
Alibaba Group
G
Guojie Lin
Alibaba Group
Yaxuan Wang
Yaxuan Wang
PhD Student of Computer Science, University of California, Santa Curz
machine learning
X
Xiaoxiao Xu
Alibaba Group
Y
Yanhu Mo
Alibaba Group
X
Xuan Ren
Alibaba Group
H
Hu Wei
Alibaba Group
Z
Ze Xu
Alibaba Group