HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Large language models (LLMs) often over-rely on lengthy chain-of-thought (CoT) reasoning for complex tasks, incurring high computational overhead and diminishing inference efficiency. Method: This paper proposes HiPO, the first framework enabling LLMs to dynamically and adaptively select reasoning modes—specifically, to intelligently decide whether to invoke detailed CoT based on task difficulty. HiPO integrates a hybrid data pipeline (comprising both explicit and implicit reasoning examples) with multi-objective reinforcement learning that jointly optimizes for accuracy and token efficiency, end-to-end training a policy for reasoning-mode selection. Results: Evaluated on mathematical and programming benchmarks, HiPO reduces average inference token consumption by 32%–47% while maintaining or improving task accuracy. These results demonstrate HiPO’s effectiveness and generalizability in achieving joint optimization of accuracy and inference efficiency.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to improve accuracy on complex tasks. However, always generating lengthy reasoning traces is inefficient, leading to excessive token usage and higher inference costs. This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control that enables LLMs to selectively decide when to engage in detailed reasoning (Think-on) and when to respond directly (Think-off). Specifically, HiPO combines a hybrid data pipelineproviding paired Think-on and Think-off responseswith a hybrid reinforcement learning reward system that balances accuracy and efficiency while avoiding over-reliance on detailed reasoning. Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy. Finally, we hope HiPO a can be a principled approach for efficient adaptive reasoning, advancing the deployment of reasoning-oriented LLMs in real-world, resource-sensitive settings.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM reasoning efficiency by reducing excessive token usage

Enabling adaptive reasoning control through hybrid policy optimization

Balancing accuracy and efficiency in dynamic reasoning processes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Policy Optimization framework for adaptive reasoning control

Combines hybrid data pipeline with paired Think-on and Think-off responses

Uses hybrid reinforcement learning reward system balancing accuracy efficiency

🔎 Similar Papers

DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search