Universal Model Routing for Efficient LLM Inference

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the problem of dynamically routing inference requests at test time to *unseen large language models (LLMs)* to reduce computational cost. Methodologically, it introduces the first general-purpose model routing framework: it extracts representative LLM feature vectors from a small set of prompts, constructs a generalizable model representation space via clustering, and incorporates a learnable cluster mapping mechanism coupled with statistical risk modeling—enabling zero-shot routing to over 30 LLMs excluded from training. Theoretically, it provides the first rigorous upper bound on routing error and proves optimality under mild assumptions. Empirically, the framework significantly reduces inference overhead across multiple public benchmarks while achieving routing accuracy close to the theoretical optimum, demonstrating strong generalization and reliability.

Technology Category

Application Category

📝 Abstract

Large language models' significant advances in capabilities are accompanied by significant increases in inference costs. Model routing is a simple technique for reducing inference cost, wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective strategies, relying on cluster-based routing and a learned cluster map respectively. We prove that these strategies are estimates of a theoretically optimal routing rule, and provide an excess risk bound to quantify their errors. Experiments on a range of public benchmarks show the effectiveness of the proposed strategies in routing amongst more than 30 unseen LLMs.

Problem

Research questions and friction points this paper is trying to address.

Dynamic routing for unseen LLMs

Efficient inference cost reduction

Cluster-based routing strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic routing for LLMs

Feature vector representation

Cluster-based routing strategies

🔎 Similar Papers

No similar papers found.

Netflix

$466,000.00 - $750,000.00

Los Gatos,California,United States of America / Los Angeles,California,United States of America

Senior Research Scientist - Machine Learning System

ByteDance

United States / China / Singapore

Authors to Follow