HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) often over-rely on lengthy chain-of-thought (CoT) reasoning for complex tasks, incurring high computational overhead and diminishing inference efficiency. Method: This paper proposes HiPO, the first framework enabling LLMs to dynamically and adaptively select reasoning modes—specifically, to intelligently decide whether to invoke detailed CoT based on task difficulty. HiPO integrates a hybrid data pipeline (comprising both explicit and implicit reasoning examples) with multi-objective reinforcement learning that jointly optimizes for accuracy and token efficiency, end-to-end training a policy for reasoning-mode selection. Results: Evaluated on mathematical and programming benchmarks, HiPO reduces average inference token consumption by 32%–47% while maintaining or improving task accuracy. These results demonstrate HiPO’s effectiveness and generalizability in achieving joint optimization of accuracy and inference efficiency.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to improve accuracy on complex tasks. However, always generating lengthy reasoning traces is inefficient, leading to excessive token usage and higher inference costs. This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control that enables LLMs to selectively decide when to engage in detailed reasoning (Think-on) and when to respond directly (Think-off). Specifically, HiPO combines a hybrid data pipelineproviding paired Think-on and Think-off responseswith a hybrid reinforcement learning reward system that balances accuracy and efficiency while avoiding over-reliance on detailed reasoning. Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy. Finally, we hope HiPO a can be a principled approach for efficient adaptive reasoning, advancing the deployment of reasoning-oriented LLMs in real-world, resource-sensitive settings.
Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM reasoning efficiency by reducing excessive token usage
Enabling adaptive reasoning control through hybrid policy optimization
Balancing accuracy and efficiency in dynamic reasoning processes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Policy Optimization framework for adaptive reasoning control
Combines hybrid data pipeline with paired Think-on and Think-off responses
Uses hybrid reinforcement learning reward system balancing accuracy efficiency
🔎 Similar Papers
No similar papers found.
Ken Deng
Ken Deng
Kwaipilot Team, Kuaishou Technology
LLMAI4SEAI Agent
Z
Zizheng Zhan
Kuaishou Technology
W
Wen Xiang
Kuaishou Technology
W
Wenqiang Zhu
Kuaishou Technology
T
Tianhao Peng
Kuaishou Technology
X
Xinping Lei
Kuaishou Technology
Weihao Li
Weihao Li
Research Fellow, Australian National University
Computer VisionMachine Learning
J
Jingxuan Xu
Kuaishou Technology
K
Kun Wu
Kuaishou Technology
Yifan Yao
Yifan Yao
Drexel University
Haoyang Huang
Haoyang Huang
JD Explore Academy (present) | StepFun | Microsoft Research
Multimodal & Multilingual Foundation Model
H
Huaixi Tang
Kuaishou Technology
K
Kepeng Lei
Kuaishou Technology
Z
Zhiyi Lai
Kuaishou Technology
S
Songwei Yu
Kuaishou Technology
Z
Zongxian Feng
Kuaishou Technology
Zuchen Gao
Zuchen Gao
Phd Candidate of The Hong Kong Polytechnic University
W
Weihao Xie
Kuaishou Technology
C
Chenchen Zhang
Kuaishou Technology
Y
Yanan Wu
Kuaishou Technology
Yuanxing Zhang
Yuanxing Zhang
Kuaishou Technology
Recommender SystemLarge Language ModelVideo Understanding
L
Lecheng Huang
Kuaishou Technology
Y
Yuqun Zhang
Kuaishou Technology
J
Jie Liu
Kuaishou Technology
Zhaoxiang Zhang
Zhaoxiang Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionPattern RecognitionBiologically-inspired Learning