Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability

📅 2026-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of current Chain-of-Thought (CoT) evaluation, which overly relies on task accuracy and fails to capture the intrinsic quality of the reasoning process itself. To overcome this, the authors propose reusability and verifiability as novel evaluation dimensions and introduce a Thinker-Executor framework that decouples reasoning generation from execution. Leveraging a multi-agent information retrieval architecture, they systematically evaluate four classes of Thinker models and ten Executor variants across five benchmarks. Experimental results reveal that the proposed metrics exhibit no significant correlation with conventional accuracy measures, and specialized reasoning models do not consistently outperform general-purpose large language models—such as Llama and Gemma—in CoT quality. These findings highlight critical blind spots in existing CoT evaluation paradigms.

Technology Category

Application Category

📝 Abstract
In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the reasoning process itself. To address this limitation, we introduce two novel measures: reusability and verifiability. We decouple CoT generation from execution using a Thinker-Executor framework. Reusability measures how easily an Executor can reuse the Thinker's CoT. Verifiability measures how frequently an Executor can match the Thinker's answer using the CoT. We evaluated four Thinker models against a committee of ten Executor models across five benchmarks. Our results reveal that reusability and verifiability do not correlate with standard accuracy, exposing a blind spot in current accuracy-based leaderboards for reasoning capability. Surprisingly, we find that CoTs from specialized reasoning models are not consistently more reusable or verifiable than those from general-purpose LLMs like Llama and Gemma.
Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought
reasoning evaluation
reusability
verifiability
multi-agent IR
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought
reusability
verifiability
Thinker-Executor framework
multi-agent reasoning
🔎 Similar Papers
No similar papers found.
S
Shashank Aggarwal
Indian Institute of Technology Guwahati, Assam, India
R
Ram Vikas Mishra
Indian Institute of Technology Guwahati, Assam, India
Amit Awekar
Amit Awekar
Indian Institute of Technology, Guwahati
Data miningNatural Language Processing