🤖 AI Summary
The evaluation of step-by-step reasoning quality in large language models lacks standardized benchmarks, resulting in fragmented metrics and inconsistent evaluation practices.
Method: We systematically survey over 60 reasoning evaluation methods and propose the first taxonomy for reasoning trace assessment, structured along four dimensions—factual consistency, validity, coherence, and practical utility. Through cross-dimension experiments, we empirically assess the transferability of evaluation models across these criteria and identify their feasibility boundaries.
Contribution/Results: We introduce a standardized meta-evaluation benchmark and establish the first structured reasoning assessment framework, explicitly mapping evaluation metrics to underlying criteria. This framework clarifies conceptual relationships among existing methods, enables systematic comparison, and lays the foundation for rigorous, reproducible, and comparable reasoning evaluation. Our work bridges critical gaps between theoretical desiderata and practical assessment, advancing the standardization and scientific rigor of LLM reasoning evaluation.
📝 Abstract
Step-by-step reasoning is widely used to enhance the reasoning ability of large language models (LLMs) in complex problems. Evaluating the quality of reasoning traces is crucial for understanding and improving LLM reasoning. However, the evaluation criteria remain highly unstandardized, leading to fragmented efforts in developing metrics and meta-evaluation benchmarks. To address this gap, this survey provides a comprehensive overview of step-by-step reasoning evaluation, proposing a taxonomy of evaluation criteria with four top-level categories (groundedness, validity, coherence, and utility). We then categorize metrics based on their implementations, survey which metrics are used for assessing each criterion, and explore whether evaluator models can transfer across different criteria. Finally, we identify key directions for future research.