Evaluating Step-by-step Reasoning Traces: A Survey

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

The evaluation of step-by-step reasoning quality in large language models lacks standardized benchmarks, resulting in fragmented metrics and inconsistent evaluation practices. Method: We systematically survey over 60 reasoning evaluation methods and propose the first taxonomy for reasoning trace assessment, structured along four dimensions—factual consistency, validity, coherence, and practical utility. Through cross-dimension experiments, we empirically assess the transferability of evaluation models across these criteria and identify their feasibility boundaries. Contribution/Results: We introduce a standardized meta-evaluation benchmark and establish the first structured reasoning assessment framework, explicitly mapping evaluation metrics to underlying criteria. This framework clarifies conceptual relationships among existing methods, enables systematic comparison, and lays the foundation for rigorous, reproducible, and comparable reasoning evaluation. Our work bridges critical gaps between theoretical desiderata and practical assessment, advancing the standardization and scientific rigor of LLM reasoning evaluation.

Technology Category

Application Category

📝 Abstract

Step-by-step reasoning is widely used to enhance the reasoning ability of large language models (LLMs) in complex problems. Evaluating the quality of reasoning traces is crucial for understanding and improving LLM reasoning. However, the evaluation criteria remain highly unstandardized, leading to fragmented efforts in developing metrics and meta-evaluation benchmarks. To address this gap, this survey provides a comprehensive overview of step-by-step reasoning evaluation, proposing a taxonomy of evaluation criteria with four top-level categories (groundedness, validity, coherence, and utility). We then categorize metrics based on their implementations, survey which metrics are used for assessing each criterion, and explore whether evaluator models can transfer across different criteria. Finally, we identify key directions for future research.

Problem

Research questions and friction points this paper is trying to address.

Assessing quality of LLM reasoning traces

Standardizing evaluation criteria for reasoning

Developing metrics for reasoning benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Taxonomy of evaluation criteria

Metrics categorization by implementation

Evaluator model transferability exploration

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting