EvalQReason: A Framework for Step-Level Reasoning Evaluation in Large Language Models

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluations of large language model reasoning predominantly rely on the correctness of final answers, offering limited insight into the reliability of intermediate reasoning steps. This work proposes EvalQReason, a novel framework that introduces, for the first time, an annotation-free, step-level probabilistic analysis of reasoning processes. By integrating Consecutive Step Divergence (CSD) and Step-to-Final Convergence (SFC) algorithms, the framework jointly captures local coherence and global alignment to quantitatively assess reasoning quality. Experiments on 7B-scale open-source models demonstrate that CSD features combined with sequential modeling achieve strong performance (F1 = 0.88, ROC-AUC = 0.97). The analysis further reveals distinct discriminative patterns in mathematical reasoning, while medical reasoning exhibits weaker signals, highlighting domain-specific characteristics and establishing a scalable new paradigm for evaluating reasoning reliability.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly deployed in critical applications requiring reliable reasoning, yet their internal reasoning processes remain difficult to evaluate systematically. Existing methods focus on final-answer correctness, providing limited insight into how reasoning unfolds across intermediate steps. We present EvalQReason, a framework that quantifies LLM reasoning quality through step-level probability distribution analysis without requiring human annotation. The framework introduces two complementary algorithms: Consecutive Step Divergence (CSD), which measures local coherence between adjacent reasoning steps, and Step-to-Final Convergence (SFC), which assesses global alignment with final answers. Each algorithm employs five statistical metrics to capture reasoning dynamics. Experiments across mathematical and medical datasets with open-source 7B-parameter models demonstrate that CSD-based features achieve strong predictive performance for correctness classification, with classical machine learning models reaching F1=0.78 and ROC-AUC=0.82, and sequential neural models substantially improving performance (F1=0.88, ROC-AUC=0.97). CSD consistently outperforms SFC, and sequential architectures outperform classical machine learning approaches. Critically, reasoning dynamics prove domain-specific: mathematical reasoning exhibits clear divergence-based discrimination patterns between correct and incorrect solutions, while medical reasoning shows minimal discriminative signals, revealing fundamental differences in how LLMs process different reasoning types. EvalQReason enables scalable, process-aware evaluation of reasoning reliability, establishing probability-based divergence analysis as a principled approach for trustworthy AI deployment.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
reasoning evaluation
step-level reasoning
reasoning reliability
probability distribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

step-level reasoning evaluation
probability distribution analysis
Consecutive Step Divergence
reasoning dynamics
trustworthy AI
S
Shaima Ahmad Freja
Department of Electrical Engineering and Computer Science, University of Stavanger, 4021 Stavanger, Norway
Ferhat Ozgur Catak
Ferhat Ozgur Catak
Assoc. Professor of Cyber Security at University of Stavanger
Trustworthy AICyber Security5G/6GData PrivacyCryptography
B
Betul Yurdem
Department of Electrical and Electronics Engineering, Izmir Bakircay University, 35665 Izmir, Turkiye
Chunming Rong
Chunming Rong
University of Stavanger
Security & BlockchainAICloud Computing