🤖 AI Summary
Large language models (LLMs) face deployment bottlenecks in software engineering tasks—including high inference cost, prolonged latency, and performance volatility—while existing static scheduling approaches rely heavily on offline training data, suffering from poor generalizability and adaptability. To address these challenges, we propose SmartLLMs Scheduler, the first framework integrating adaptive cache management, a multi-feature-driven joint performance–cost prediction model, task-aware dynamic scheduling, and real-time policy updating. Crucially, it operates without reliance on pre-collected training data, enabling lightweight, flexible online optimization. Extensive experiments on log parsing and code generation demonstrate that SmartLLMs Scheduler achieves an average 198.82% improvement in performance and reduces end-to-end processing time by 63.28% over state-of-the-art baselines.
📝 Abstract
Large Language Models (LLMs) such as GPT-4 and Llama have shown remarkable capabilities in a variety of software engineering tasks. Despite the advancements, their practical deployment faces challenges, including high financial costs, long response time, and varying performance, especially when handling a large number of queries (jobs). Existing optimization strategies for deploying LLMs for diverse tasks focus on static scheduling, which requires extensive training data for performance prediction, increasing the computational costs and limiting the applicability and flexibility. In this paper, we propose the SmartLLMs Scheduler (SLS), a dynamic and cost-effective scheduling solution. The key idea is to learn LLMs' performance on diverse tasks and incorporate their real-time feedback to update strategies periodically. Specifically, SLS incorporates three key components, including an Adaptive Cache Manager, a Performance-Cost Optimized Scheduler, and a Dynamic Update Manager. The Cache Manager stores the outputs of previously processed queries and employs an adaptive strategy to reduce redundant computations and minimize response times. For queries not found in the cache, the Scheduler dynamically allocates them to the most suitable LLM based on the predicted performance and cost from models that take both query-specific and LLM-specific features as input. The Update Manager continuously refines the cache and scheduling strategies based on real-time feedback from the assigned queries to enhance decision-making and adapt to evolving task characteristics. To evaluate the effectiveness of SLS, we conduct extensive experiments on two LLM-based software engineering tasks, including log parsing and code generation. The results show that SLS significantly outperforms the baseline methods, achieving an average performance improvement of 198.82% and an average processing time reduction of 63.28%.