🤖 AI Summary
This work addresses the uncontrolled computational overhead in large language model (LLM) chain-of-thought reasoning. We propose ThinkDial—the first open-source, end-to-end controllable reasoning framework—enabling dynamic switching among high-, medium-, and low-effort discrete reasoning modes. Methodologically, it integrates budget-aware supervised fine-tuning with a two-stage, budget-constrained reinforcement learning (RL) pipeline, incorporating bandwidth-aware RL and adaptive reward shaping to internalize controllability throughout the reasoning process. Experiments demonstrate that, relative to baseline full-effort inference: (i) the medium mode reduces token consumption by 50% with <10% performance degradation; (ii) the low mode achieves 75% token reduction with <15% degradation; and (iii) strong out-of-distribution generalization is observed. The core contribution is the first open implementation of a GPT-4o–style tunable reasoning mechanism, enabling precise trade-offs between inference cost and accuracy.
📝 Abstract
Large language models (LLMs) with chain-of-thought reasoning have demonstrated remarkable problem-solving capabilities, but controlling their computational effort remains a significant challenge for practical deployment. Recent proprietary systems like OpenAI's gpt-oss series have introduced discrete operational modes for intuitive reasoning control, but the open-source community has largely failed to achieve such capabilities. In this paper, we introduce ThinkDial, the first open-recipe end-to-end framework that successfully implements gpt-oss-style controllable reasoning through discrete operational modes. Our system enables seamless switching between three distinct reasoning regimes: High mode (full reasoning capability), Medium mode (50 percent token reduction with <10 percent performance degradation), and Low mode (75 percent token reduction with <15 percent performance degradation). We achieve this through an end-to-end training paradigm that integrates budget-mode control throughout the entire pipeline: budget-mode supervised fine-tuning that embeds controllable reasoning capabilities directly into the learning process, and two-phase budget-aware reinforcement learning with adaptive reward shaping. Extensive experiments demonstrate that ThinkDial achieves target compression-performance trade-offs with clear response length reductions while maintaining performance thresholds. The framework also exhibits strong generalization capabilities on out-of-distribution tasks.