Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the fundamental trade-off between performance and efficiency in large language model (LLM) inference. We propose Avengers-Pro, a dynamic routing framework that leverages query embeddings and clustering to orchestrate heterogeneous LLMs—varying in capacity and computational efficiency—to jointly optimize the accuracy-cost Pareto frontier online. Its core contribution is the first unified routing paradigm enabling arbitrary accuracy-efficiency trade-offs, and the first demonstration of multi-model ensembles strictly dominating the best single model in both accuracy and cost. Evaluated across six benchmarks, Avengers-Pro achieves an average 7% higher accuracy than the strongest single model (GPT-5-medium), or matches its accuracy at 27% lower inference cost; under extreme compression (63% cost reduction), it retains 90% of the original model’s performance.

Technology Category

Application Category

📝 Abstract

Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies, providing a unified solution for all performance-efficiency tradeoffs. The Avengers-Pro embeds and clusters incoming queries, then routes each to the most suitable model based on a performance-efficiency score. Across 6 challenging benchmarks and 8 leading models -- including GPT-5-medium, Gemini-2.5-pro, and Claude-opus-4.1 -- Avengers-Pro achieves state-of-the-art results: by varying a performance-efficiency trade-off parameter, it can surpass the strongest single model (GPT-5-medium) by +7% in average accuracy. Moreover, it can match the average accuracy of the strongest single model at 27% lower cost, and reach ~90% of that performance at 63% lower cost. Last but not least, it achieves a Pareto frontier, consistently yielding the highest accuracy for any given cost, and the lowest cost for any given accuracy, among all single models. Code is available at https://github.com/ZhangYiqun018/AvengersPro.

Problem

Research questions and friction points this paper is trying to address.

Optimizing performance-efficiency tradeoffs in LLMs

Dynamic query routing to suitable model capacities

Reducing costs while maintaining high accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic query routing to optimize efficiency

Embedding and clustering for model selection

Performance-efficiency score for cost-effective routing

🔎 Similar Papers

Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models