π€ AI Summary
This paper addresses the cost-effectiveness optimization of prompt allocation in multi-model generative AI services. Existing approaches largely ignore inter-model pricing disparities and prioritize performance alone. To bridge this gap, we propose the first cost-aware online learning framework for prompt allocation. Given sequentially arriving user prompts, our method employs a dynamic βlow-cost-first, progressive fallbackβ scheduling strategy. It jointly estimates task difficulty and model response quality in real time, enabling Pareto-optimal model selection via adaptive decision thresholds. The lightweight mechanism ensures low latency while maximizing cost efficiency. Experiments on puzzle solving, code generation, and code translation tasks demonstrate up to a 47% reduction in average service cost, alongside improved response satisfaction and higher system throughput.
π Abstract
The rapid advancement of generative AI models has provided users with numerous options to address their prompts. When selecting a generative AI model for a given prompt, users should consider not only the performance of the chosen model but also its associated service cost. The principle guiding such consideration is to select the least expensive model among the available satisfactory options. However, existing model-selection approaches typically prioritize performance, overlooking pricing differences between models. In this paper, we introduce PromptWise, an online learning framework designed to assign a sequence of prompts to a group of large language models (LLMs) in a cost-effective manner. PromptWise strategically queries cheaper models first, progressing to more expensive options only if the lower-cost models fail to adequately address a given prompt. Through numerical experiments, we demonstrate PromptWise's effectiveness across various tasks, including puzzles of varying complexity and code generation/translation tasks. The results highlight that PromptWise consistently outperforms cost-unaware baseline methods, emphasizing that directly assigning prompts to the most expensive models can lead to higher costs and potentially lower average performance.