What Defines Good Reasoning in LLMs? Dissecting Reasoning Steps with Multi-Aspect Evaluation

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM evaluations predominantly rely on final answer correctness, hindering fine-grained diagnosis of reasoning flaws. To address this, we propose a novel, fine-grained reasoning quality evaluation paradigm that decouples reasoning quality into two orthogonal dimensions: relevance and coherence. We introduce Causal Stepwise Evaluation (CaSE), a causally grounded methodology that mitigates post-hoc bias by assessing each reasoning step in context without relying on the final answer. CaSE enables scalable, interpretable reasoning analysis and debugging. We construct two expert-annotated benchmarks—MRa-GSM8K and MRa-MATH—and validate their reliability via human evaluation. Furthermore, leveraging CaSE for high-quality reasoning path selection in data filtering yields substantial performance gains: +3.2–5.7% absolute accuracy improvements on GSM8K and MATH, respectively. These results empirically demonstrate that fine-grained reasoning assessment delivers tangible benefits for model optimization.

Technology Category

Application Category

📝 Abstract
Evaluating large language models (LLMs) on final-answer correctness is the dominant paradigm. This approach, however, provides a coarse signal for model improvement and overlooks the quality of the underlying reasoning process. We argue that a more granular evaluation of reasoning offers a more effective path to building robust models. We decompose reasoning quality into two dimensions: relevance and coherence. Relevance measures if a step is grounded in the problem; coherence measures if it follows logically from prior steps. To measure these aspects reliably, we introduce causal stepwise evaluation (CaSE). This method assesses each reasoning step using only its preceding context, which avoids hindsight bias. We validate CaSE against human judgments on our new expert-annotated benchmarks, MRa-GSM8K and MRa-MATH. More importantly, we show that curating training data with CaSE-evaluated relevance and coherence directly improves final task performance. Our work provides a scalable framework for analyzing, debugging, and improving LLM reasoning, demonstrating the practical value of moving beyond validity checks.
Problem

Research questions and friction points this paper is trying to address.

Evaluates reasoning steps beyond final answer correctness
Measures relevance and coherence in LLM reasoning processes
Introduces scalable framework to improve model reasoning quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal stepwise evaluation avoids hindsight bias
Measures reasoning relevance and coherence dimensions
Training data curation improves final task performance
🔎 Similar Papers
No similar papers found.