ARM: Adaptive Reasoning Model

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Large reasoning models often exhibit “overthinking”—redundant token consumption on simple tasks—due to fixed inference paths, contradicting the goal of autonomous AI. This work proposes an adaptive inference format switching mechanism that dynamically selects the optimal reasoning format (Direct Answer, Short Chain-of-Thought, Code, or Long Chain-of-Thought) based on task difficulty. Our contributions are threefold: (1) Ada-GRPO, a novel reinforcement learning algorithm extending Group Relative Policy Optimization, effectively mitigates format collapse in multi-format training; (2) dual expansion modes—instruction-guided and consensus-guided—enabling flexible policy adaptation; and (3) a multi-format reasoning architecture with dynamic decision gating. Experiments demonstrate an average 30% reduction in inference tokens (up to 70%), a 2× speedup in training, and performance on par with a pure Long CoT baseline.

Technology Category

Application Category

📝 Abstract

While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the"overthinking"problem -- excessive and unnecessary reasoning -- which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones -- Direct Answer, Short CoT, and Code -- as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens -- ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.

Problem

Research questions and friction points this paper is trying to address.

Adaptive reasoning model adjusts token usage by task difficulty

Mitigates overthinking problem in autonomous AI reasoning tasks

Improves token efficiency and training speed without performance loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive reasoning format selection based on task

Ada-GRPO training method prevents format collapse

Three reasoning modes for efficiency and flexibility

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting