Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the problem of unfaithful reasoning in large language models, where generated chain-of-thought (CoT) rationales often misalign with the model’s internal reasoning process. To tackle this, the authors propose CIE-Scorer, a novel framework that integrates circuit tracing from mechanistic interpretability with external reasoning signals. Specifically, it constructs lightweight internal computation graphs and external reasoning graphs using sentence-level keyword units, then quantifies their structural discrepancy via a fused Gromov–Wasserstein distance metric to enable efficient, instance-level faithfulness detection. This approach substantially reduces the computational cost of circuit construction while achieving state-of-the-art performance across four datasets in the FaithCoT-Bench benchmark, demonstrating both effectiveness and scalability.

📝 Abstract

Chain-of-thought (CoT) reasoning improves the problem-solving ability of large language models (LLMs), but generated reasoning traces may not faithfully reflect the model's actual decision process. Existing CoT unfaithfulness detectors mainly rely on external signals from generated rationales, such as textual plausibility or answer consistency, while overlooking evidence from the model's internal computation. Although recent circuit tracing methods provide a way to obtain model-internal evidence by tracing how information flows through model components during reasoning, constructing full reasoning circuits for long CoTs is costly and difficult to scale. To address these challenges, we propose Circuit-guided Internal-External Discrepancy Scorer (CIE-Scorer), a framework for instance-level CoT unfaithfulness detection. The key idea is that faithful reasoning traces should align with the model's computational process, whereas unfaithful traces may diverge from it. CIE-Scorer efficiently traces compact sentence-level circuits from informative reasoning tokens, constructs internal and external reasoning graphs, and measures their discrepancy using Fused Gromov--Wasserstein distance. Experiments on four datasets from FaithCoT-Bench show that CIE-Scorer achieves state-of-the-art performance while reducing the cost of circuit construction, demonstrating the effectiveness of combining mechanistic interpretability signals with external reasoning traces for CoT unfaithfulness detection.

Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought

unfaithfulness detection

large language models

reasoning traces

internal-external discrepancy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought

Circuit Tracing

Internal-External Discrepancy