🤖 AI Summary
This study challenges the prevailing assumption that chain-of-thought reasoning in language models requires dense, strictly ordered natural language. Through systematic interventions—including deletion, masking, shuffling, and noise injection—the authors evaluate the impact of line-level, word-level, and token-level perturbations on answer extraction across multiple models and benchmarks. Remarkably, models retain 83% accuracy even when all natural language is removed and reasoning steps are arbitrarily reordered. While masking numerical tokens drives performance to near zero, masking textual tokens unexpectedly improves accuracy by 4.7 percentage points. These findings suggest that answer extraction relies not on coherent, sequential reasoning traces but on a sparse, disordered yet structurally robust informational substrate, thereby questioning the necessity of dense and ordered chains of thought in model reasoning.
📝 Abstract
Modern reasoning language models generate dense, sequential chain-of-thought traces implicitly assuming that every token contributes and that steps must be consumed in order. We challenge both assumptions through a systematic intervention pipeline--removal, masking, shuffling, and noise injection--applied to model-generated reasoning chains across three models and three benchmarks. Our findings are counterintuitive on three dimensions. Order: Does the sequential order of a reasoning chain matter for answer extraction? No--line-level shuffling reduces accuracy by less than 0.5 pp; word-level shuffling retains 62%-89% accuracy; only token-level shuffling collapses to near zero. Pretrained-only and instruction-tuned variants exhibit near-identical tolerance (78.67% vs. 78.00% under line shuffling), indicating order-independence originates from pretraining rather than reasoning-specific fine-tuning. Dense: Is all the information in a reasoning chain important for answer extraction? No--masking numeric digits collapses accuracy to exactly 0%, while masking alphabetic prose improves accuracy by 4.7 pp. Robustness: Is a reasoning chain that is both order-shuffling and non-dense still robust? Yes--the most aggressively reduced representation (all natural language removed, lines arbitrarily shuffled) still achieves 83% accuracy, and injecting false answers at 3x true-answer frequency leaves accuracy unchanged (83.3%->83.3%), falsifying a frequency-based extraction account. These results establish that answer extraction operates on a sparse, order-insensitive, and structurally robust informational substrate, opening paths toward parallelized and token-efficient reasoning generation.