🤖 AI Summary
Large language models (LLMs) often over-rely on lengthy chain-of-thought (CoT) reasoning for complex tasks, incurring high computational overhead and diminishing inference efficiency.
Method: This paper proposes HiPO, the first framework enabling LLMs to dynamically and adaptively select reasoning modes—specifically, to intelligently decide whether to invoke detailed CoT based on task difficulty. HiPO integrates a hybrid data pipeline (comprising both explicit and implicit reasoning examples) with multi-objective reinforcement learning that jointly optimizes for accuracy and token efficiency, end-to-end training a policy for reasoning-mode selection.
Results: Evaluated on mathematical and programming benchmarks, HiPO reduces average inference token consumption by 32%–47% while maintaining or improving task accuracy. These results demonstrate HiPO’s effectiveness and generalizability in achieving joint optimization of accuracy and inference efficiency.
📝 Abstract
Large Language Models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to improve accuracy on complex tasks. However, always generating lengthy reasoning traces is inefficient, leading to excessive token usage and higher inference costs. This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control that enables LLMs to selectively decide when to engage in detailed reasoning (Think-on) and when to respond directly (Think-off). Specifically, HiPO combines a hybrid data pipelineproviding paired Think-on and Think-off responseswith a hybrid reinforcement learning reward system that balances accuracy and efficiency while avoiding over-reliance on detailed reasoning. Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy. Finally, we hope HiPO a can be a principled approach for efficient adaptive reasoning, advancing the deployment of reasoning-oriented LLMs in real-world, resource-sensitive settings.