SyncThink: A Training-Free Strategy to Align Inference Termination with Reasoning Saturation

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

This work addresses the high computational cost of Chain-of-Thought (CoT) reasoning, which often stems from verbose and repetitive generation trajectories. The authors propose a training-free, plug-and-play decoding method that dynamically detects reasoning saturation by leveraging the model’s attention patterns toward a special token, “/think,” thereby automatically truncating redundant outputs without any fine-tuning. Compatible with mainstream large language model architectures, the approach achieves an average Top-1 accuracy of 62.00% across multiple benchmarks while using only 656 tokens and incurring a latency of 28.68 seconds—reducing generation length and inference time by over 70% and 69%, respectively, compared to full CoT. Notably, it also yields up to an 8.1% absolute accuracy improvement on challenging tasks such as GPQA.

Technology Category

Application Category

📝 Abstract

Chain-of-Thought (CoT) prompting improves reasoning but often produces long and redundant traces that substantially increase inference cost. We present SyncThink, a training-free and plug-and-play decoding method that reduces CoT overhead without modifying model weights. We find that answer tokens attend weakly to early reasoning and instead focus on the special token"/think", indicating an information bottleneck. Building on this observation, SyncThink monitors the model's own reasoning-transition signal and terminates reasoning. Experiments on GSM8K, MMLU, GPQA, and BBH across three DeepSeek-R1 distilled models show that SyncThink achieves 62.00 percent average Top-1 accuracy using 656 generated tokens and 28.68 s latency, compared to 61.22 percent, 2141 tokens, and 92.01 s for full CoT decoding. On long-horizon tasks such as GPQA, SyncThink can further yield up to +8.1 absolute accuracy by preventing over-thinking.

Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought

inference cost

reasoning saturation

over-thinking

decoding efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

SyncThink

Chain-of-Thought

reasoning saturation

training-free decoding

inference termination

🔎 Similar Papers

No similar papers found.

Authors to Follow