🤖 AI Summary
This work addresses the limitation of existing hallucination detection methods, which often rely on superficial answer cues and fail to assess whether genuine reasoning processes are employed. To tackle this, the authors propose a controlled invariance analysis framework that introduces two perturbations—forced answer replacement (Force) and removal of answer statements (Remove)—and, for the first time, incorporates an Oracle test to expose answer-level spurious correlations. Building upon this framework, they design TRACT, a lightweight scoring model that leverages lexical-level trajectory features such as hedging patterns, step-length dynamics, and cross-response lexical convergence. TRACT maintains robust performance even after eliminating answer-related spurious signals and achieves competitive or superior results compared to current baselines on original reasoning trajectories.
📝 Abstract
Hallucination detection methods for large language models increasingly operate on chain-of-thought reasoning traces, yet it remains unclear whether they evaluate the reasoning itself or merely exploit surface correlates of the final answer. We introduce a controlled-invariance methodology that exposes this distinction through two oracle tests: \textsc{Force}, which replaces each response's final answer with the ground truth while preserving the reasoning trace, and \textsc{Remove}, which strips answer-announcement steps while leaving the trajectory intact. This reveals if their predictive power derives from answer-level artifacts rather than from the structure or validity of intermediate reasoning. We further show that once these artifacts are controlled for, effective detection does not necessarily require complex learned representations: TRACT, a lightweight scorer built on lexical trajectory features (hedging trends, step-length dynamics, and cross-response vocabulary convergence), achieves strong robustness while remaining competitive with or outperforming existing baselines on unperturbed traces. These findings suggest that the current central challenge in reasoning-aware hallucination detection is not the absence of signal in the trace, but the failure to isolate it from endpoint cues.