🤖 AI Summary
This work addresses the systematic failure of large language models when surface-level cues conflict with implicit feasibility constraints. The authors propose a “diagnose–measure–bridge–intervene” framework to systematically characterize the heuristic override phenomenon and introduce the Heuristic Override Benchmark (HOB), comprising 500 instances, which reveals models’ overreliance on superficial cues and conservative biases. Leveraging causal behavioral analysis, token-level attribution, parametric probing, and minimal pair designs, the study demonstrates that 14 mainstream models exhibit substantial performance degradation under constraint-conflicting scenarios, with peak accuracy reaching only 75%. Strategic interventions—particularly goal-decomposition prompting—yield consistent improvements of 6–9 percentage points, while a minimal prompting strategy achieves gains of approximately 15 percentage points.
📝 Abstract
Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose-measure-bridge-treat framework. Causal-behavioral analysis of the ``car wash problem'' across six models reveals approximately context-independent sigmoid heuristics: the distance cue exerts 8.7 to 38 times more influence than the goal, and token-level attribution shows patterns more consistent with keyword associations than compositional inference. The Heuristic Override Benchmark (HOB) -- 500 instances spanning 4 heuristic by 5 constraint families with minimal pairs and explicitness gradients -- demonstrates generality across 14 models: under strict evaluation (10/10 correct), no model exceeds 75%, and presence constraints are hardest (44%). A minimal hint (e.g., emphasizing the key object) recovers +15 pp on average, suggesting the failure lies in constraint inference rather than missing knowledge; 12/14 models perform worse when the constraint is removed (up to -39 pp), revealing conservative bias. Parametric probes confirm that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics; goal-decomposition prompting recovers +6 to 9 pp by forcing models to enumerate preconditions before answering. Together, these results characterize heuristic override as a systematic reasoning vulnerability and provide a benchmark for measuring progress toward resolving it.