🤖 AI Summary
This work addresses a critical failure mode in long-chain-of-thought reasoning, wherein models, once committing an early error, often become trapped in self-consistent yet incorrect reasoning trajectories—termed “reasoning traps”—that subsequent reflection struggles to correct. The study formally defines and quantifies this phenomenon for the first time and introduces the TAAR framework, which dynamically diagnoses erroneous reasoning prefixes during inference. By integrating trajectory analysis to predict both trap locations and escape probabilities, TAAR enables test-time intervention without fine-tuning, leveraging temperature-based resampling and structured suffix restarts. Evaluated on challenging benchmarks including AIME24 and GPQA-Diamond, the method significantly improves accuracy, with 89% of previously failed cases confirmed to involve reasoning traps.
📝 Abstract
Scaling test-time compute via Long Chain-of-Thought (Long-CoT) significantly enhances reasoning capabilities, yet extended generation does not guarantee correctness: after an early wrong commitment, models may keep elaborating a self-consistent but incorrect prefix. Through fine-grained trajectory analysis, we identify Thinking Traps, prefix-dominant deadlocks where later reflection, alternative attempts, or verification fails to revise the root error. On a curated subset of DAPO-MATH, 89\% of failures exhibit such traps. To solve this problem, we introduce TAAR (Trap-Aware Adaptive Restart), a test-time control framework that trains a diagnostic policy to predict two signals from partial trajectories: a trap index for where to truncate and an escape probability for whether and how strongly to intervene. At inference time, TAAR truncates the trajectory before the predicted trap segment and adaptively restarts decoding; for severely trapped cases, it applies stronger perturbations, including higher-temperature resampling and an optional structured reboot suffix. Experiments on challenging mathematical and scientific reasoning benchmarks (AIME24, AIME25, GPQA-Diamond, HMMT25, BRUMO25) show that TAAR improves reasoning performance without fine-tuning base model parameters.