EvalQReason: A Framework for Step-Level Reasoning Evaluation in Large Language Models

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Current evaluations of large language model reasoning predominantly rely on the correctness of final answers, offering limited insight into the reliability of intermediate reasoning steps. This work proposes EvalQReason, a novel framework that introduces, for the first time, an annotation-free, step-level probabilistic analysis of reasoning processes. By integrating Consecutive Step Divergence (CSD) and Step-to-Final Convergence (SFC) algorithms, the framework jointly captures local coherence and global alignment to quantitatively assess reasoning quality. Experiments on 7B-scale open-source models demonstrate that CSD features combined with sequential modeling achieve strong performance (F1 = 0.88, ROC-AUC = 0.97). The analysis further reveals distinct discriminative patterns in mathematical reasoning, while medical reasoning exhibits weaker signals, highlighting domain-specific characteristics and establishing a scalable new paradigm for evaluating reasoning reliability.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly deployed in critical applications requiring reliable reasoning, yet their internal reasoning processes remain difficult to evaluate systematically. Existing methods focus on final-answer correctness, providing limited insight into how reasoning unfolds across intermediate steps. We present EvalQReason, a framework that quantifies LLM reasoning quality through step-level probability distribution analysis without requiring human annotation. The framework introduces two complementary algorithms: Consecutive Step Divergence (CSD), which measures local coherence between adjacent reasoning steps, and Step-to-Final Convergence (SFC), which assesses global alignment with final answers. Each algorithm employs five statistical metrics to capture reasoning dynamics. Experiments across mathematical and medical datasets with open-source 7B-parameter models demonstrate that CSD-based features achieve strong predictive performance for correctness classification, with classical machine learning models reaching F1=0.78 and ROC-AUC=0.82, and sequential neural models substantially improving performance (F1=0.88, ROC-AUC=0.97). CSD consistently outperforms SFC, and sequential architectures outperform classical machine learning approaches. Critically, reasoning dynamics prove domain-specific: mathematical reasoning exhibits clear divergence-based discrimination patterns between correct and incorrect solutions, while medical reasoning shows minimal discriminative signals, revealing fundamental differences in how LLMs process different reasoning types. EvalQReason enables scalable, process-aware evaluation of reasoning reliability, establishing probability-based divergence analysis as a principled approach for trustworthy AI deployment.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

reasoning evaluation

step-level reasoning

reasoning reliability

probability distribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

step-level reasoning evaluation

probability distribution analysis

Consecutive Step Divergence