ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning

📅 2026-01-13

📈 Citations: 1

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the limitations of existing large reasoning models, which typically employ fixed-length chains of thought, leading to computational redundancy and inflexible deployment that hinders dynamic trade-offs between inference cost and accuracy. To overcome this, we propose a controllable multi-budget reasoning framework that activates distinct reasoning modes based on input-specific triggers. Leveraging a multi-stage reinforcement learning approach, the framework identifies Pareto-optimal strategies under varying computational budgets, which are subsequently unified into a single model via online policy distillation. Our method achieves, for the first time, a flexible decoupling of inference cost and performance, enabling controllable multi-mode reasoning. It maintains clear separation among reasoning modes while efficiently integrating diverse state-of-the-art strategies without compromising overall performance.

Technology Category

Application Category

📝 Abstract

Recent Large Reasoning Models (LRMs) achieve strong performance by leveraging long-form Chain-of-Thought (CoT) reasoning, but uniformly applying overlong reasoning at inference time incurs substantial and often unnecessary computational cost. To address this, prior work explores various strategies to infer an appropriate reasoning budget from the input. However, such approaches are unreliable in the worst case, as estimating the minimal required reasoning effort is fundamentally difficult, and they implicitly fix the trade-off between reasoning cost and accuracy during training, limiting flexibility under varying deployment scenarios. Motivated by these limitations, we propose ORBIT, a controllable multi-budget reasoning framework with well-separated reasoning modes triggered by input. ORBIT employs multi-stage reinforcement learning to discover Pareto-optimal reasoning behaviors at each effort, followed by on-policy distillation to fuse these behaviors into a single unified model. Experiments show that ORBIT achieves (1) controllable reasoning behavior over multiple modes, (2) competitive reasoning density within each mode, and (3) integration of these frontier policies into a single unified student model while preserving clear mode separation and high per-mode performance.

Problem

Research questions and friction points this paper is trying to address.

reasoning budget

computational cost

Chain-of-Thought

controllable reasoning

multi-budget reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

controllable reasoning

multi-budget reasoning

on-policy distillation