🤖 AI Summary
This work addresses the pervasive model lock-in problem in large language model (LLM) ecosystems, where integrating new models incurs substantial retraining costs. To overcome this, the authors propose ZeroRouter, a novel framework that introduces a model-agnostic universal latent space. By employing a context-aware predictor to map queries into this shared space, ZeroRouter decouples query characteristics from individual model performance, enabling zero-shot integration of new models without retraining. The framework further incorporates a dual-mode optimizer that dynamically balances accuracy, inference cost, and latency according to deployment requirements. Extensive experiments demonstrate that ZeroRouter significantly outperforms existing routing methods across multiple benchmarks, achieving higher routing accuracy at reduced computational cost and lower latency.
📝 Abstract
The rapid proliferation of Large Language Models (LLMs) has led to a fragmented and inefficient ecosystem, a state of ``model lock-in''where seamlessly integrating novel models remains a significant bottleneck. Current routing frameworks require exhaustive, costly retraining, hindering scalability and adaptability. We introduce ZeroRouter, a new paradigm for LLM routing that breaks this lock-in. Our approach is founded on a universal latent space, a model-agnostic representation of query difficulty that fundamentally decouples the characterization of a query from the profiling of a model. This allows for zero-shot onboarding of new models without full-scale retraining. ZeroRouter features a context-aware predictor that maps queries to this universal space and a dual-mode optimizer that balances accuracy, cost, and latency. Our framework consistently outperforms all baselines, delivering higher accuracy at lower cost and latency.