ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning

📅 2026-01-13
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing large reasoning models, which typically employ fixed-length chains of thought, leading to computational redundancy and inflexible deployment that hinders dynamic trade-offs between inference cost and accuracy. To overcome this, we propose a controllable multi-budget reasoning framework that activates distinct reasoning modes based on input-specific triggers. Leveraging a multi-stage reinforcement learning approach, the framework identifies Pareto-optimal strategies under varying computational budgets, which are subsequently unified into a single model via online policy distillation. Our method achieves, for the first time, a flexible decoupling of inference cost and performance, enabling controllable multi-mode reasoning. It maintains clear separation among reasoning modes while efficiently integrating diverse state-of-the-art strategies without compromising overall performance.

Technology Category

Application Category

📝 Abstract
Recent Large Reasoning Models (LRMs) achieve strong performance by leveraging long-form Chain-of-Thought (CoT) reasoning, but uniformly applying overlong reasoning at inference time incurs substantial and often unnecessary computational cost. To address this, prior work explores various strategies to infer an appropriate reasoning budget from the input. However, such approaches are unreliable in the worst case, as estimating the minimal required reasoning effort is fundamentally difficult, and they implicitly fix the trade-off between reasoning cost and accuracy during training, limiting flexibility under varying deployment scenarios. Motivated by these limitations, we propose ORBIT, a controllable multi-budget reasoning framework with well-separated reasoning modes triggered by input. ORBIT employs multi-stage reinforcement learning to discover Pareto-optimal reasoning behaviors at each effort, followed by on-policy distillation to fuse these behaviors into a single unified model. Experiments show that ORBIT achieves (1) controllable reasoning behavior over multiple modes, (2) competitive reasoning density within each mode, and (3) integration of these frontier policies into a single unified student model while preserving clear mode separation and high per-mode performance.
Problem

Research questions and friction points this paper is trying to address.

reasoning budget
computational cost
Chain-of-Thought
controllable reasoning
multi-budget reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

controllable reasoning
multi-budget reasoning
on-policy distillation
Pareto-optimal policies
reinforcement learning
🔎 Similar Papers
No similar papers found.
Kun Liang
Kun Liang
University of Waterloo
C
Clive Bai
LLM Department, Tencent
X
Xin Xu
LLM Department, Tencent; The Hong Kong University of Science and Technology
C
Chenming Tang
School of Computer Science, Peking University; National Key Laboratory for Multimedia Information Processing, Peking University
Sanwoo Lee
Sanwoo Lee
Peking University
Natural Language ProcessingDeep Learning
Weijie Liu
Weijie Liu
Nankai University
System SecurityVirtualizationBinary AnalysisImage Fusion
S
Saiyong Yang
LLM Department, Tencent
Yunfang Wu
Yunfang Wu
Peking University
NLP