🤖 AI Summary
Large reasoning models (LRMs) suffer from “overthinking”—generating unnecessarily long reasoning chains even on simple tasks, degrading inference efficiency. Method: We propose the Adaptive Cognitive Policy Optimization (ACPO) framework, inspired by dual-process theory in cognitive science. ACPO explicitly models two reasoning modes—“fast thinking” (intuition) and “slow thinking” (logic)—via system-aware reasoning tokens, and introduces a dynamic system-switching mechanism jointly driven by online task-difficulty estimation and token-budget allocation. It jointly optimizes reasoning depth and breadth via reinforcement learning and supervised fine-tuning. Contribution/Results: ACPO achieves Pareto improvements in both accuracy and efficiency across diverse complex reasoning tasks—including mathematical and symbolic reasoning—reducing redundant reasoning tokens by 38% on average while maintaining or improving answer accuracy.
📝 Abstract
Large reasoning models (LRMs) have demonstrated strong performance on complex reasoning tasks, but often suffer from overthinking, generating redundant content regardless of task difficulty. Inspired by the dual process theory in cognitive science, we propose Adaptive Cognition Policy Optimization (ACPO), a reinforcement learning framework that enables LRMs to achieve efficient reasoning through adaptive cognitive allocation and dynamic system switch. ACPO incorporates two key components: (1) introducing system-aware reasoning tokens to explicitly represent the thinking modes thereby making the model's cognitive process transparent, and (2) integrating online difficulty estimation and token length budget to guide adaptive system switch and reasoning during reinforcement learning. To this end, we propose a two-stage training strategy. The first stage begins with supervised fine-tuning to cold start the model, enabling it to generate reasoning paths with explicit thinking modes. In the second stage, we apply ACPO to further enhance adaptive system switch for difficulty-aware reasoning. Experimental results demonstrate that ACPO effectively reduces redundant reasoning while adaptively adjusting cognitive allocation based on task complexity, achieving efficient hybrid reasoning.