🤖 AI Summary
This study systematically evaluates the impact of truncated chain-of-thought (CoT) reasoning under severe token budget constraints (10%–70%) on mathematical reasoning tasks such as AIME and GSM8K. Investigating multiple reasoning modalities—including code, natural language commentary, hybrid approaches, and no reasoning—across state-of-the-art models (e.g., GPT-5.1, Gemini 3 Flash, Grok, and DeepSeek-V3.2), the work reveals that code-based reasoning is significantly more robust to truncation than natural language. Contrary to common assumptions, hybrid reasoning does not consistently yield superior performance. Notably, Grok maintains 80–90% accuracy even at 30% token budget, whereas DeepSeek-V3.2 suffers a sharp decline to 17% accuracy at 50% budget, highlighting substantial inter-model variation in sensitivity to incomplete reasoning traces.
📝 Abstract
Reasoning-specialized models like OpenAI's 5.1 and DeepSeek-V3.2 allocate substantial inference compute to extended chain-of-thought (CoT) traces, yet reasoning tokens incur significant costs. How do different reasoning modalities of code, natural language, hybrid, or none do perform under token constraints? We introduce a framework that constrains models to reason exclusively through code, comments, both, or neither, then systematically ablates token budgets to 10\%, 30\%, 50\%, and 70\% of optimal. We evaluate four frontier models (GPT-5.1, Gemini 3 Flash, DeepSeek-V3.2, Grok 4.1) across mathematical benchmarks (AIME, GSM8K, HMMT). Our findings reveal: (1) \textbf{truncated reasoning can hurt} as DeepSeek-V3.2 achieves 53\% with no reasoning but only 17\% with truncated CoT at 50\% budget; (2) \textbf{code degrades gracefully} as Gemini's comments collapse to 0\% while code maintains 43-47\%; (3) \textbf{hybrid reasoning underperforms} single modalities; (4) \textbf{robustness is model-dependent} as Grok maintains 80-90\% at 30\% budget where OpenAI and DeepSeek collapse to 7-27\%. These results suggest incomplete reasoning chains actively mislead models, with implications for deploying reasoning-specialized systems under resource constraints.