🤖 AI Summary
This work proposes a confidence-based, multi-scale dynamic model selection mechanism to reduce the computational and API invocation costs of large language model inference while maintaining high accuracy. By continuously evaluating task-specific confidence in real time, the system intelligently routes high-confidence tasks to smaller, efficient models and reserves larger, more capable models for low-confidence or complex queries. The approach integrates response accuracy prediction with a cascaded multi-model inference strategy. Evaluated on the MMLU benchmark, the system achieves accuracy comparable to that of the largest model while reducing computational cost by 20%–40%. Furthermore, when applied to GPT-4o API calls, it decreases token consumption by approximately 60%, substantially enhancing both inference efficiency and cost-effectiveness.
📝 Abstract
Large Language Models (LLMs) have revolutionized inference across diverse natural language tasks, with larger models performing better but at higher computational costs. We propose a confidence-driven strategy that dynamically selects the most suitable model based on confidence estimates. By assessing a model's confidence in handling the task and response accuracy, tasks that are likely to be solved correctly are retained, while more uncertain or complex cases are delegated to a larger model, ensuring reliability while minimizing computation. Specifically, we evaluate a model's likelihood of knowing the correct answer and the probability that its response is accurate. Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%. When applied to GPT-4o API calls, it reduces token usage by approximately 60\%, further improving cost efficiency. These findings indicate the potential of confidence-based model selection to enhance real-world LLM deployment, particularly in resource-constrained settings such as edge devices and commercial API applications.