π€ AI Summary
This work investigates whether the intermediate structures generated by large language models (LLMs) during schema-guided reasoning causally determine their final outputs or merely serve as epiphenomena. To address this, the authors propose the first evaluation protocol based on quantifiable causal interventions: by applying controlled perturbations to intermediate structures and measuring resulting changes in model outputs, they assess the modelsβ faithfulness to these structures. Experiments across eight mainstream LLMs and three benchmark tasks reveal that in up to 60% of cases, model outputs remain unchanged despite corrections to the intermediate structures, exposing significant causal fragility. Further analysis demonstrates that integrating external tools substantially enhances causal faithfulness, whereas prompt engineering yields only marginal improvements.
π Abstract
Schema-guided reasoning pipelines ask LLMs to produce explicit intermediate structures -- rubrics, checklists, verification queries -- before committing to a final decision. But do these structures causally determine the output, or merely accompany it? We introduce a causal evaluation protocol that makes this directly measurable: by selecting tasks where a deterministic function maps intermediate structures to decisions, every controlled edit implies a unique correct output. Across eight models and three benchmarks, models appear self-consistent with their own intermediate structures but fail to update predictions after intervention in up to 60% of cases -- revealing that apparent faithfulness is fragile once the intermediate structure changes. When derivation of the final decision from the structure is delegated to an external tool, this fragility largely disappears; however, prompts which ask to prioritize the intermediate structure over the original input do not materially close the gap. Overall, intermediate structures in schema-guided pipelines function as influential context rather than stable causal mediators.