Cost-Aware Contrastive Routing for LLMs

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the cost-aware routing problem for large language models (LLMs) in dynamic, heterogeneous LLM pools. Existing approaches often neglect prompt context, rely on expensive model-based analysis, assume static expert assignments, or resort to inefficient trial-and-error strategies. To overcome these limitations, we propose CSCR—a novel framework introducing adaptive contrastive learning within a cost band, enabling zero-retraining dynamic model expansion and microsecond-scale inference. CSCR constructs a unified embedding space via logit footprints and perplexity fingerprints, jointly leveraging a contrastive encoder and FAISS for efficient nearest-neighbor routing. Evaluated across multiple benchmarks, CSCR improves the accuracy–cost trade-off by up to 25% compared to state-of-the-art methods, while demonstrating strong generalization to unseen models and out-of-distribution prompts.

Technology Category

Application Category

📝 Abstract
We study cost-aware routing for large language models across diverse and dynamic pools of models. Existing approaches often overlook prompt-specific context, rely on expensive model profiling, assume a fixed set of experts, or use inefficient trial-and-error strategies. We introduce Cost-Spectrum Contrastive Routing (CSCR), a lightweight framework that maps both prompts and models into a shared embedding space to enable fast, cost-sensitive selection. CSCR uses compact, fast-to-compute logit footprints for open-source models and perplexity fingerprints for black-box APIs. A contrastive encoder is trained to favor the cheapest accurate expert within adaptive cost bands. At inference time, routing reduces to a single k-NN lookup via a FAISS index, requiring no retraining when the expert pool changes and enabling microsecond latency. Across multiple benchmarks, CSCR consistently outperforms baselines, improving the accuracy-cost tradeoff by up to 25%, while generalizing robustly to unseen LLMs and out-of-distribution prompts.
Problem

Research questions and friction points this paper is trying to address.

Cost-aware routing for diverse LLM pools
Lightweight prompt-model embedding for fast selection
Optimizing accuracy-cost tradeoff in dynamic expert sets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight shared embedding space for routing
Contrastive encoder favors cheapest accurate expert
Single k-NN lookup with FAISS index
🔎 Similar Papers