🤖 AI Summary
This work addresses the high computational cost of Chain-of-Thought (CoT) reasoning, which often stems from verbose and repetitive generation trajectories. The authors propose a training-free, plug-and-play decoding method that dynamically detects reasoning saturation by leveraging the model’s attention patterns toward a special token, “/think,” thereby automatically truncating redundant outputs without any fine-tuning. Compatible with mainstream large language model architectures, the approach achieves an average Top-1 accuracy of 62.00% across multiple benchmarks while using only 656 tokens and incurring a latency of 28.68 seconds—reducing generation length and inference time by over 70% and 69%, respectively, compared to full CoT. Notably, it also yields up to an 8.1% absolute accuracy improvement on challenging tasks such as GPQA.
📝 Abstract
Chain-of-Thought (CoT) prompting improves reasoning but often produces long and redundant traces that substantially increase inference cost. We present SyncThink, a training-free and plug-and-play decoding method that reduces CoT overhead without modifying model weights. We find that answer tokens attend weakly to early reasoning and instead focus on the special token"/think", indicating an information bottleneck. Building on this observation, SyncThink monitors the model's own reasoning-transition signal and terminates reasoning. Experiments on GSM8K, MMLU, GPQA, and BBH across three DeepSeek-R1 distilled models show that SyncThink achieves 62.00 percent average Top-1 accuracy using 656 generated tokens and 28.68 s latency, compared to 61.22 percent, 2141 tokens, and 92.01 s for full CoT decoding. On long-horizon tasks such as GPQA, SyncThink can further yield up to +8.1 absolute accuracy by preventing over-thinking.