π€ AI Summary
Large language models (LLMs) achieve higher accuracy on complex reasoning tasksβe.g., chain-of-thought (CoT)βbut incur substantial latency and token overhead. Method: This paper proposes a semantic routing mechanism that dynamically determines, based on query semantics, whether to activate advanced reasoning modes (e.g., CoT); reasoning is thus conditionally enabled only for queries requiring it. Contribution/Results: It introduces fine-grained semantic classification into dynamic routing decisions for the first time and integrates conditional reasoning execution with resource-aware scheduling within the vLLM framework. On MMLU-Pro, our method improves accuracy by 10.2 percentage points over the baseline while reducing average response latency by 47.1% and token consumption by 48.5%. These gains significantly enhance the Pareto optimality between accuracy and inference efficiency in open-source LLM serving systems.
π Abstract
Large Language Models (LLMs) demonstrate substantial accuracy gains when augmented with reasoning modes such as chain-of-thought and inference-time scaling. However, reasoning also incurs significant costs in inference latency and token usage, with environmental and financial impacts, which are unnecessary for many simple prompts. We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial. Our approach achieves a 10.2 percentage point improvement in accuracy on the MMLU-Pro benchmark while reducing response latency by 47.1% and token consumption by 48.5% compared to direct inference with vLLM. These results demonstrate that semantic routing offers an effective mechanism for striking a balance between accuracy and efficiency in open-source LLM serving systems