Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models (LRMs) suffer from “overthinking”—generating unnecessarily long reasoning chains even on simple tasks, degrading inference efficiency. Method: We propose the Adaptive Cognitive Policy Optimization (ACPO) framework, inspired by dual-process theory in cognitive science. ACPO explicitly models two reasoning modes—“fast thinking” (intuition) and “slow thinking” (logic)—via system-aware reasoning tokens, and introduces a dynamic system-switching mechanism jointly driven by online task-difficulty estimation and token-budget allocation. It jointly optimizes reasoning depth and breadth via reinforcement learning and supervised fine-tuning. Contribution/Results: ACPO achieves Pareto improvements in both accuracy and efficiency across diverse complex reasoning tasks—including mathematical and symbolic reasoning—reducing redundant reasoning tokens by 38% on average while maintaining or improving answer accuracy.

Technology Category

Application Category

📝 Abstract
Large reasoning models (LRMs) have demonstrated strong performance on complex reasoning tasks, but often suffer from overthinking, generating redundant content regardless of task difficulty. Inspired by the dual process theory in cognitive science, we propose Adaptive Cognition Policy Optimization (ACPO), a reinforcement learning framework that enables LRMs to achieve efficient reasoning through adaptive cognitive allocation and dynamic system switch. ACPO incorporates two key components: (1) introducing system-aware reasoning tokens to explicitly represent the thinking modes thereby making the model's cognitive process transparent, and (2) integrating online difficulty estimation and token length budget to guide adaptive system switch and reasoning during reinforcement learning. To this end, we propose a two-stage training strategy. The first stage begins with supervised fine-tuning to cold start the model, enabling it to generate reasoning paths with explicit thinking modes. In the second stage, we apply ACPO to further enhance adaptive system switch for difficulty-aware reasoning. Experimental results demonstrate that ACPO effectively reduces redundant reasoning while adaptively adjusting cognitive allocation based on task complexity, achieving efficient hybrid reasoning.
Problem

Research questions and friction points this paper is trying to address.

Reducing redundant reasoning in large language models
Adaptive cognitive allocation for task complexity
Dynamic system switch for efficient hybrid reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning framework ACPO for efficient reasoning
System-aware tokens represent transparent cognitive processes
Two-stage training with supervised fine-tuning and ACPO
🔎 Similar Papers
No similar papers found.
Xiaoxue Cheng
Xiaoxue Cheng
Renmin University of China
J
Junyi Li
Department of Computer Science, National University of Singapore
Z
Zhenduo Zhang
Ant Group
X
Xinyu Tang
Gaoling School of Artificial Intelligence, Renmin University of China
Wayne Xin Zhao
Wayne Xin Zhao
Professor, Renmin University of China
Recommender SystemNatural Language ProcessingLarge Language Model
X
Xinyu Kong
Ant Group
Z
Zhiqiang Zhang
Ant Group