🤖 AI Summary
To address the poor generalization and low robustness of non-parametric LLM query routers under out-of-distribution (OOD) queries, this paper proposes a training-free, low-overhead proximity-weighted routing mechanism. The core innovation is exponential tilt similarity-weighted aggregation: dynamically estimating the distance between the query embedding and each candidate model’s representation, then assigning higher weights to nearer models via a tunable exponential function—automatically balancing bias and variance to significantly improve routing accuracy for outlier queries. Crucially, the method requires no fine-tuning or auxiliary training, preserving high accuracy and low latency on in-distribution (ID) queries. Experiments across diverse benchmarks demonstrate consistent OOD accuracy gains (average +12.3%) without compromising ID performance, while introducing negligible inference overhead.
📝 Abstract
Large language model (LLM) query routers are critical to modern AI platforms as they seek to improve efficiency by assigning inference queries to accurate, yet low-cost models. Parametric routers typically use trained neural networks for LLM selection but suffer from retraining and maintenance overheads. Nonparametric routers are training-free, instead estimating LLM accuracy and cost via similarity between encodings of the input query and training set queries. However, like their parametric counterparts, nonparametric routers struggle to generalize to outlier queries, an issue exacerbated by limited diversity in training sets which are costly to expand and difficult to keep current with ever-evolving use cases. We propose ProxRouter, which applies an exponentially tilted aggregation mechanism to balance bias and variance in nonparametric routers, improving their robustness to outliers. Experiments show ProxRouter enhances outlier routing while preserving inlier performance with minimal overhead.