One Head, Many Models: Cross-Attention Routing for Cost-Aware LLM Selection

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of dynamically selecting optimal large language models (LLMs) during deployment—balancing response quality against computational cost—this paper proposes a lightweight unified routing framework. Methodologically, it introduces a single-head cross-attention mechanism to jointly model fine-grained query-model interactions and designs an exponential reward function that explicitly encodes user-specified quality-cost trade-offs. The framework integrates joint query-model embedding learning, dual-objective prediction (quality and cost), and end-to-end routing decision-making. Evaluated on the large-scale RouterBench benchmark, it achieves an average AIQ improvement of 6.6% and up to 2.9% higher peak performance over state-of-the-art methods. Moreover, it demonstrates strong cross-domain generalization and incurs minimal inference overhead.

Technology Category

Application Category

📝 Abstract
The proliferation of large language models (LLMs) with varying computational costs and performance profiles presents a critical challenge for scalable, cost-effective deployment in real-world applications. We introduce a unified routing framework that leverages a single-head cross-attention mechanism to jointly model query and model embeddings, enabling dynamic selection of the optimal LLM for each input query. Our approach is evaluated on RouterBench, a large-scale, publicly available benchmark encompassing diverse LLM pools and domains. By explicitly capturing fine-grained query-model interactions, our router predicts both response quality and generation cost, achieving up to 6.6% improvement in Average Improvement in Quality (AIQ) and 2.9% in maximum performance over existing routers. To robustly balance performance and cost, we propose an exponential reward function that enhances stability across user preferences. The resulting architecture is lightweight, generalizes effectively across domains, and demonstrates improved efficiency compared to prior methods, establishing a new standard for cost-aware LLM routing.
Problem

Research questions and friction points this paper is trying to address.

Dynamic selection of optimal LLM per query
Balancing performance and computational cost trade-offs
Modeling fine-grained query-model interactions for routing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-head cross-attention routing mechanism
Dynamic LLM selection per query
Exponential reward function balancing cost-performance
🔎 Similar Papers
No similar papers found.
R
Roshini Pulishetty
University of Massachusetts, Amherst
M
Mani Kishan Ghantasala
University of Massachusetts, Amherst
K
Keerthy Kaushik Dasoju
University of Massachusetts, Amherst
N
Niti Mangwani
University of Massachusetts, Amherst
Vishal Garimella
Vishal Garimella
University of Massachusetts, Amherst
Aditya Mate
Aditya Mate
Microsoft New England
Machine LearningSequential Decision MakingReinforcement LearningBanditsProbabilistic
S
Somya Chatterjee
Microsoft
Y
Yue Kang
Microsoft
E
Ehi Nosakhare
Microsoft
S
Sadid Hasan
Microsoft
S
Soundar Srinivasan
Microsoft