🤖 AI Summary
Existing approaches struggle to effectively evaluate the reasoning quality of large language models across diverse programming tasks and lack dedicated benchmarks. To address this gap, this work introduces CodeRQ-Bench, the first benchmark specifically designed for assessing programming reasoning quality, encompassing code generation, summarization, and classification tasks. Furthermore, we propose VERA, a two-stage evaluation framework that integrates evidence-based verification with ambiguity-aware score calibration. Our study systematically uncovers five key limitations of current evaluators and distills four design principles to guide future development. Experimental results demonstrate that VERA significantly outperforms strong baselines across four datasets, achieving improvements of up to 0.26 in AUC-ROC and 0.21 in AUPRC. The benchmark and code are publicly released.
📝 Abstract
Large language models (LLMs) increasingly rely on explicit reasoning to solve coding tasks, yet evaluating the quality of this reasoning remains challenging. Existing reasoning evaluators are not designed for coding, and current benchmarks focus primarily on code generation, leaving other coding tasks largely unexplored. We introduce CodeRQ-Bench, the first benchmark for evaluating LLM reasoning quality across three coding task categories: generation, summarization, and classification. Using this benchmark, we analyze 1,069 mismatch cases from existing evaluators, identify five recurring limitations, and derive four design insights for reasoning evaluation in coding tasks. Guided by these insights, we propose VERA, a two-stage evaluator that combines evidence-grounded verification with ambiguity-aware score correction. Experiments on CodeRQ-Bench show that VERA consistently outperforms strong baselines across four datasets, improving AUCROC by up to 0.26 and AUPRC by up to 0.21. We release CodeRQ-Bench at https://github.com/MrLYG/CodeRQ-Bench, supporting future investigations.