π€ AI Summary
This work addresses the high latency and computational overhead of large reasoning models (LRMs) in multi-step chain-of-thought reasoning, where existing collaborative approaches struggle to efficiently determine when to invoke the large model. The authors propose a lightweight, training-free collaborative inference framework that leverages a small model to generate the first token of each reasoning step and uses its entropy for dynamic routing: low-entropy steps are completed by the small model, while high-entropy steps trigger a switch to the large model. This approach reveals, for the first time, that the entropy of the initial token effectively predicts reasoning difficulty, enabling intuitive, βeureka-momentβ-style decisions without generating full steps or relying on post-hoc verification. Evaluated on benchmarks including AIME25, the method reduces inference latency by 25.9% and improves accuracy by 10.7% compared to using the large model alone, achieving a strong balance between efficiency and performance.
π Abstract
Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the"Aha Moment"phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.