When to Reason: Semantic Router for vLLM

πŸ“… 2025-10-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large language models (LLMs) achieve higher accuracy on complex reasoning tasksβ€”e.g., chain-of-thought (CoT)β€”but incur substantial latency and token overhead. Method: This paper proposes a semantic routing mechanism that dynamically determines, based on query semantics, whether to activate advanced reasoning modes (e.g., CoT); reasoning is thus conditionally enabled only for queries requiring it. Contribution/Results: It introduces fine-grained semantic classification into dynamic routing decisions for the first time and integrates conditional reasoning execution with resource-aware scheduling within the vLLM framework. On MMLU-Pro, our method improves accuracy by 10.2 percentage points over the baseline while reducing average response latency by 47.1% and token consumption by 48.5%. These gains significantly enhance the Pareto optimality between accuracy and inference efficiency in open-source LLM serving systems.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) demonstrate substantial accuracy gains when augmented with reasoning modes such as chain-of-thought and inference-time scaling. However, reasoning also incurs significant costs in inference latency and token usage, with environmental and financial impacts, which are unnecessary for many simple prompts. We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial. Our approach achieves a 10.2 percentage point improvement in accuracy on the MMLU-Pro benchmark while reducing response latency by 47.1% and token consumption by 48.5% compared to direct inference with vLLM. These results demonstrate that semantic routing offers an effective mechanism for striking a balance between accuracy and efficiency in open-source LLM serving systems
Problem

Research questions and friction points this paper is trying to address.

Selectively applying reasoning to reduce LLM costs
Classifying queries by reasoning needs for efficiency
Balancing accuracy and latency in LLM inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic router classifies query reasoning requirements
Selectively applies reasoning only when beneficial
Reduces latency and token usage while improving accuracy
πŸ”Ž Similar Papers
No similar papers found.
C
Chen Wang
IBM Research, Yorktown Heights, NY , 10598
X
Xunzhuo Liu
Tencent
Y
Yuhan Liu
University of Chicago
Yue Zhu
Yue Zhu
IBM Research
Performance OptimizationI/OStorageCloud
Xiangxi Mo
Xiangxi Mo
UC Berkeley
Junchen Jiang
Junchen Jiang
University of Chicago
Computer Networks
H
Huamin Chen
Red Hat, Boston, MA, 02210