🤖 AI Summary
This work proposes BoundaryRouter, a training-free query routing framework designed to reduce inference latency and computational cost in large language models by dynamically deciding between lightweight model responses and full agent execution. Operating effectively under cold-start conditions, BoundaryRouter introduces the first training-free routing mechanism that leverages early behavioral experience through a compact memory bank, combining similarity-based case retrieval with rule-guided reasoning to make routing decisions. The authors also introduce RouteBench, a new benchmark for evaluating routing performance across diverse scenarios. Experimental results demonstrate that BoundaryRouter reduces inference time by 60.6% compared to full agent execution while improving overall performance by 28.6% over standard large language model inference, significantly outperforming both prompt engineering and pure retrieval baselines.
📝 Abstract
LLM agents achieve strong performance on complex reasoning tasks but incur high latency and compute cost. In practice, many queries fall within the capability boundary of cutting-edge LLMs and do not require full agent execution, making effective routing between LLMs and agents a key challenge. We study the problem of routing queries between lightweight LLM inference and full agent execution under realistic cold-start settings. To address this, we propose BoundaryRouter, a training-free routing framework that uses early behavioral experience and rubric-guided reasoning to decide whether to answer a query with direct LLM inference or escalate to an agent. BoundaryRouter builds a compact experience memory by executing both systems on a shared seed set and retrieves similar cases at inference time to guide routing decisions. To evaluate this method, we introduce RouteBench, a benchmark covering in-domain, paraphrased, and out-of-domain route settings. Experiments show that BoundaryRouter reduces inference time by 60.6% compared to the agent while improving performance by 28.6% over direct LLM inference, outperforming prompt-based and retrieval-only routing by an average of 37.9% and 8.2%, respectively.