🤖 AI Summary
This work addresses the lack of a unified evaluation framework for reasoning-mode switching strategies in hybrid large language models, which hinders fair comparison across diverse approaches and training mechanisms. The authors propose HRBench, the first standardized benchmark encompassing three switching paradigms—prompt-based selection, external routing, and speculative execution—and four training protocols: no training, supervised fine-tuning (SFT), and offline/online reinforcement learning. Evaluated across six models and five reasoning benchmarks under twelve controlled settings, the study reveals that prompt-based strategies achieve the best trade-off between token efficiency and accuracy, routing methods offer more stable computational costs, and speculative execution improves accuracy at higher computational overhead. Crucially, the effectiveness of each strategy is shown to depend significantly on model scale and task type.
📝 Abstract
Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer quality against inference cost. However, existing methods for adaptive thinking-mode selection are typically evaluated under different models, datasets, and implementation assumptions, making it difficult to compare their practical behavior. We introduce HRBench, a unified evaluation framework for studying thinking-mode switching in hybrid-reasoning LLMs. HRBench organizes the design space along two axes: three switching strategy families, prompt-based selection, external routing, and speculative execution, and four training regimes, training-free, SFT, offline and online RL, yielding 12 controlled evaluation settings. We evaluate these settings across 6 LLMs, from Qwen3.5-2B to Kimi-K2.5-1.1T, and 5 reasoning benchmarks covering mathematics, science, and code, while reimplementing 12+ representative prior methods within the same pipeline. Our analysis characterizes how different switching strategies occupy distinct effectiveness-efficiency trade-off regions: prompt-based methods often provide favorable token-accuracy trade-offs, routing methods offer more stable cost reduction, and speculative methods tend to improve accuracy at higher token cost. We further find that training affects strategies differently, and that the preferred strategy varies with model scale and task domain. HRBench provides reference implementations and a unified evaluation platform to support more controlled research on efficient reasoning in hybrid-reasoning LLMs. Our data, code and repository are available at https://github.com/usail-hkust/HRBench.