Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fundamental trade-off between performance and efficiency in large language model (LLM) inference. We propose Avengers-Pro, a dynamic routing framework that leverages query embeddings and clustering to orchestrate heterogeneous LLMs—varying in capacity and computational efficiency—to jointly optimize the accuracy-cost Pareto frontier online. Its core contribution is the first unified routing paradigm enabling arbitrary accuracy-efficiency trade-offs, and the first demonstration of multi-model ensembles strictly dominating the best single model in both accuracy and cost. Evaluated across six benchmarks, Avengers-Pro achieves an average 7% higher accuracy than the strongest single model (GPT-5-medium), or matches its accuracy at 27% lower inference cost; under extreme compression (63% cost reduction), it retains 90% of the original model’s performance.

Technology Category

Application Category

📝 Abstract
Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies, providing a unified solution for all performance-efficiency tradeoffs. The Avengers-Pro embeds and clusters incoming queries, then routes each to the most suitable model based on a performance-efficiency score. Across 6 challenging benchmarks and 8 leading models -- including GPT-5-medium, Gemini-2.5-pro, and Claude-opus-4.1 -- Avengers-Pro achieves state-of-the-art results: by varying a performance-efficiency trade-off parameter, it can surpass the strongest single model (GPT-5-medium) by +7% in average accuracy. Moreover, it can match the average accuracy of the strongest single model at 27% lower cost, and reach ~90% of that performance at 63% lower cost. Last but not least, it achieves a Pareto frontier, consistently yielding the highest accuracy for any given cost, and the lowest cost for any given accuracy, among all single models. Code is available at https://github.com/ZhangYiqun018/AvengersPro.
Problem

Research questions and friction points this paper is trying to address.

Optimizing performance-efficiency tradeoffs in LLMs
Dynamic query routing to suitable model capacities
Reducing costs while maintaining high accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic query routing to optimize efficiency
Embedding and clustering for model selection
Performance-efficiency score for cost-effective routing
🔎 Similar Papers
No similar papers found.
Y
Yiqun Zhang
Shanghai Artificial Intelligence Laboratory
H
Hao Li
Shanghai Artificial Intelligence Laboratory
J
Jianhao Chen
Shanghai Artificial Intelligence Laboratory
Hangfan Zhang
Hangfan Zhang
PhD student, Pennsylvania State University
P
Peng Ye
Shanghai Artificial Intelligence Laboratory
Lei Bai
Lei Bai
Shanghai AI Laboratory
Foundation ModelScience IntelligenceMulti-Agent SystemAutonomous Discovery
Shuyue Hu
Shuyue Hu
Shanghai Artificial Intelligence Lab
multiagent systemlarge language modelgame theory