🤖 AI Summary
This work addresses the challenges of high energy consumption and the inflexibility of static single-model strategies in large language model (LLM) inference, which struggle to balance accuracy and efficiency. The authors propose a context-aware dynamic routing framework that selects the optimal model from a heterogeneous pool based on lightweight query features—such as task type, semantic clustering, and textual complexity—and employs a multi-armed bandit algorithm to online-optimize the routing policy. This approach requires no offline calibration, enables seamless integration of new models, and adaptively trades off accuracy against energy consumption under partial feedback. Experiments across five benchmark tasks show a 22% improvement in accuracy and a 31% reduction in cumulative energy usage compared to random routing. On RouterBench, the method achieves an average accuracy of 71.7%, with a peak of 75.7%.
📝 Abstract
Large language models (LLMs) demonstrate remarkable capabilities, but their broad deployment is limited by significant computational resource demands, particularly energy consumption during inference. Static, one-model-fits-all inference strategies are often inefficient, as they do not exploit the diverse range of available models or adapt to varying query requirements. This paper presents GreenServ, a dynamic, context-aware routing framework that optimizes the trade-off between inference accuracy and energy efficiency. GreenServ extracts lightweight contextual features from each query, including task type, semantic cluster, and text complexity, and routes queries to the most suitable model from a heterogeneous pool, based on observed accuracy and energy usage. We employ a multi-armed bandit approach to learn adaptive routing policies online. This approach operates under partial feedback, eliminates the need for extensive offline calibration, and streamlines the integration of new models into the inference pipeline. We evaluated GreenServ across five benchmark tasks and a pool of 16 contemporary open-access LLMs. Experimental results show that GreenServ consistently outperforms static (single-model) and random baselines. In particular, compared to random routing, GreenServ achieved a 22% increase in accuracy while reducing cumulative energy consumption by 31%. Finally, we evaluated GreenServ with RouterBench, achieving an average accuracy of 71.7% with a peak accuracy of 75.7%. All artifacts are open-source and available as an anonymous repository for review purposes here: https://anonymous.4open.science/r/llm-inference-router-EBEA/README.md