π€ AI Summary
To address the inefficiency of reasoning-augmented large language models (RLLMs) β specifically, their over-reliance on lengthy reasoning chains even for simple tasks, leading to excessive token consumption and degraded inference efficiency β this paper proposes a dynamic routing mechanism that assesses a modelβs problem-solving capability in real time prior to inference and adaptively selects between general-purpose and reasoning-intensive modes. Key contributions include: (1) the first capability-aware embedding derived from hidden-layer representations, coupled with a lightweight pre-decision router; and (2) Gradient-10K, the first densely sampled dataset explicitly designed for modeling fine-grained difficulty boundaries. Evaluated across multiple benchmarks, our method matches the accuracy of full-reasoning baselines while reducing token usage by 30β55%. It demonstrates robustness across diverse model scales and reasoning paradigms, including chain-of-thought (CoT) and tree-of-thought (ToT).
π Abstract
While reasoning-augmented large language models (RLLMs) significantly enhance complex task performance through extended reasoning chains, they inevitably introduce substantial unnecessary token consumption, particularly for simpler problems where Short Chain-of-Thought (Short CoT) suffices. This overthinking phenomenon leads to inefficient resource usage without proportional accuracy gains. To address this issue, we propose Self-Route, a dynamic reasoning framework that automatically selects between general and reasoning modes based on model capability estimation. Our approach introduces a lightweight pre-inference stage to extract capability-aware embeddings from hidden layer representations, enabling real-time evaluation of the model's ability to solve problems. We further construct Gradient-10K, a model difficulty estimation-based dataset with dense complexity sampling, to train the router for precise capability boundary detection. Extensive experiments demonstrate that Self-Route achieves comparable accuracy to reasoning models while reducing token consumption by 30-55% across diverse benchmarks. The proposed framework demonstrates consistent effectiveness across models with different parameter scales and reasoning paradigms, highlighting its general applicability and practical value.