Thinking Traps in Long Chain-of-Thought: A Measurable Study and Trap-Aware Adaptive Restart

📅 2026-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical failure mode in long-chain-of-thought reasoning, wherein models, once committing an early error, often become trapped in self-consistent yet incorrect reasoning trajectories—termed “reasoning traps”—that subsequent reflection struggles to correct. The study formally defines and quantifies this phenomenon for the first time and introduces the TAAR framework, which dynamically diagnoses erroneous reasoning prefixes during inference. By integrating trajectory analysis to predict both trap locations and escape probabilities, TAAR enables test-time intervention without fine-tuning, leveraging temperature-based resampling and structured suffix restarts. Evaluated on challenging benchmarks including AIME24 and GPQA-Diamond, the method significantly improves accuracy, with 89% of previously failed cases confirmed to involve reasoning traps.

Technology Category

Application Category

📝 Abstract
Scaling test-time compute via Long Chain-of-Thought (Long-CoT) significantly enhances reasoning capabilities, yet extended generation does not guarantee correctness: after an early wrong commitment, models may keep elaborating a self-consistent but incorrect prefix. Through fine-grained trajectory analysis, we identify Thinking Traps, prefix-dominant deadlocks where later reflection, alternative attempts, or verification fails to revise the root error. On a curated subset of DAPO-MATH, 89\% of failures exhibit such traps. To solve this problem, we introduce TAAR (Trap-Aware Adaptive Restart), a test-time control framework that trains a diagnostic policy to predict two signals from partial trajectories: a trap index for where to truncate and an escape probability for whether and how strongly to intervene. At inference time, TAAR truncates the trajectory before the predicted trap segment and adaptively restarts decoding; for severely trapped cases, it applies stronger perturbations, including higher-temperature resampling and an optional structured reboot suffix. Experiments on challenging mathematical and scientific reasoning benchmarks (AIME24, AIME25, GPQA-Diamond, HMMT25, BRUMO25) show that TAAR improves reasoning performance without fine-tuning base model parameters.
Problem

Research questions and friction points this paper is trying to address.

Thinking Traps
Long Chain-of-Thought
reasoning errors
prefix-dominant deadlocks
test-time reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Thinking Traps
Long Chain-of-Thought
Adaptive Restart
Test-time Compute
Reasoning Robustness
🔎 Similar Papers
No similar papers found.
K
Kang Chen
Fudan University
F
Fan Yu
Fudan University
J
Junjie Nian
Fudan University
S
Shihan Zhao
Fudan University
Z
Zhuoka Feng
Fudan University
Z
Zijun Yao
H
Heng Wang
Fudan University
M
Minshen Yu
Fudan University
Yixin Cao
Yixin Cao
Fudan University
Natural Language ProcessingKnowledge EngineeringMulti-modal data processing