🤖 AI Summary
Current evaluations of large language model reasoning predominantly rely on the correctness of final answers, offering limited insight into the reliability of intermediate reasoning steps. This work proposes EvalQReason, a novel framework that introduces, for the first time, an annotation-free, step-level probabilistic analysis of reasoning processes. By integrating Consecutive Step Divergence (CSD) and Step-to-Final Convergence (SFC) algorithms, the framework jointly captures local coherence and global alignment to quantitatively assess reasoning quality. Experiments on 7B-scale open-source models demonstrate that CSD features combined with sequential modeling achieve strong performance (F1 = 0.88, ROC-AUC = 0.97). The analysis further reveals distinct discriminative patterns in mathematical reasoning, while medical reasoning exhibits weaker signals, highlighting domain-specific characteristics and establishing a scalable new paradigm for evaluating reasoning reliability.
📝 Abstract
Large Language Models (LLMs) are increasingly deployed in critical applications requiring reliable reasoning, yet their internal reasoning processes remain difficult to evaluate systematically. Existing methods focus on final-answer correctness, providing limited insight into how reasoning unfolds across intermediate steps. We present EvalQReason, a framework that quantifies LLM reasoning quality through step-level probability distribution analysis without requiring human annotation. The framework introduces two complementary algorithms: Consecutive Step Divergence (CSD), which measures local coherence between adjacent reasoning steps, and Step-to-Final Convergence (SFC), which assesses global alignment with final answers. Each algorithm employs five statistical metrics to capture reasoning dynamics. Experiments across mathematical and medical datasets with open-source 7B-parameter models demonstrate that CSD-based features achieve strong predictive performance for correctness classification, with classical machine learning models reaching F1=0.78 and ROC-AUC=0.82, and sequential neural models substantially improving performance (F1=0.88, ROC-AUC=0.97). CSD consistently outperforms SFC, and sequential architectures outperform classical machine learning approaches. Critically, reasoning dynamics prove domain-specific: mathematical reasoning exhibits clear divergence-based discrimination patterns between correct and incorrect solutions, while medical reasoning shows minimal discriminative signals, revealing fundamental differences in how LLMs process different reasoning types. EvalQReason enables scalable, process-aware evaluation of reasoning reliability, establishing probability-based divergence analysis as a principled approach for trustworthy AI deployment.