🤖 AI Summary
In reinforcement learning–based post-training of language models, “slow thinking”—characterized by redundant computation on simple tasks and premature termination on complex ones—severely degrades reasoning efficiency. Method: This paper proposes AdapThink, an adaptive thinking preference framework that (1) introduces a group-relative reward function grounded in model confidence and response characteristics to dynamically regulate reasoning length, and (2) employs an entropy-guided, diversity-aware sampling mechanism to jointly optimize accuracy and reasoning-path diversity. Results: Evaluated across multiple mathematical reasoning benchmarks, AdapThink significantly enhances the adaptive reasoning capability of DeepSeek-distilled models: it reduces average inference steps by 32%, improves solution stability on complex problems by 19.7%, and maintains—or even slightly improves—overall accuracy.
📝 Abstract
Reinforcement Learning (RL)-based post-training has significantly advanced the complex reasoning capabilities of language models, fostering sophisticated self-reflection processes. However, this ``slow thinking'' paradigm presents a critical challenge to reasoning efficiency: models may expend excessive computation on simple questions and shift reasoning prematurely for complex ones. Previous mechanisms typically rely on static length budgets or predefined rules, lacking the adaptability for varying question complexities and models' evolving capabilities. To this end, we propose AdapThink, an adaptive post-training framework designed to induce more efficient thinking while maintaining the performance of reasoning language models. Specifically, AdapThink incorporates two key mechanisms: 1) A group-relative reward function that leverages model confidence and response's characteristic to dynamically adjust the preference of reflection-related transition words without resorting to a fixed length preference. 2) A diversity-aware sampling mechanism that balances the training group's solution accuracy with reasoning diversity via an entropy-guided score. Experiments on several mathematical reasoning datasets with DeepSeek-distilled models demonstrate AdapThink's advantages in enabling adaptive reasoning patterns and mitigating the inefficiencies.