Misaligning Reasoning with Answers -- A Framework for Assessing LLM CoT Robustness

๐Ÿ“… 2025-05-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Chain-of-thought (CoT) reasoning in large language models (LLMs) exhibits poor robustness to input perturbations, leading to inconsistencies between final answers and underlying reasoning stepsโ€”severely hindering trustworthy deployment in safety-critical domains. Method: We propose MATCHA, the first evaluation framework quantifying answer-reasoning alignment to systematically assess CoT consistency robustness. It integrates input perturbation generation, LLM-based judging, consistency deviation analysis, and cross-model transfer testing. Contribution/Results: MATCHA reveals that multi-step and commonsense reasoning tasks exhibit significantly lower robustness than logical reasoning tasks; further, it empirically validates nontrivial transferability of reasoning fragility across diverse LLMs. As the first diagnostic benchmark targeting answer-reasoning consistency, MATCHA uncovers structural weaknesses in current LLM reasoning and provides a foundation for designing and validating robust reasoning architectures.

Technology Category

Application Category

๐Ÿ“ Abstract
LLMs' decision-making process is opaque, prompting the need for explanation techniques like Chain-of-Thought. To investigate the relationship between answer and reasoning, we design a novel evaluation framework, MATCHA. In domains like education and healthcare, reasoning is key for model trustworthiness. MATCHA reveals that LLMs under input perturbations can give inconsistent or nonsensical reasoning. Additionally, we use LLM judges to assess reasoning robustness across models. Our results show that LLMs exhibit greater vulnerability to input perturbations for multi-step and commonsense tasks than compared to logical tasks. Also, we show non-trivial transfer rates of our successful examples to black-box models. Our evaluation framework helps to better understand LLM reasoning mechanisms and guides future models toward more robust and reasoning-driven architectures, enforcing answer-reasoning consistency.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM reasoning robustness under input perturbations
Investigating answer-reasoning consistency in Chain-of-Thought outputs
Evaluating vulnerability of LLMs in multi-step and commonsense tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

MATCHA framework evaluates reasoning robustness
LLM judges assess reasoning across models
Enforces answer-reasoning consistency in LLMs
๐Ÿ”Ž Similar Papers
No similar papers found.
E
Enyi Jiang
University of Illinois at Urbana-Champaign
Changming Xu
Changming Xu
University of Illinois Urbana Champaign
Trustworthy Machine Learning
N
Nischay Singh
University of Illinois at Urbana-Champaign
G
Gagandeep Singh
University of Illinois at Urbana-Champaign