๐ค AI Summary
Selecting the optimal model for a new task remains challenging amidst the proliferation of open-source models and untested datasets. This work proposes ModelLens, the first framework to directly learn a model capability map from large-scale, heterogeneous evaluation records. By constructing a performance-aware latent space grounded in modelโdatasetโmetric triplets, ModelLens enables zero-shot ranking and recommendation of unseen models on unseen datasets. Evaluated on a new benchmark comprising 1.62 million evaluation records, ModelLens substantially outperforms existing approaches. Its top-K recommended model pools boost the performance of diverse routing strategies by up to 81% on question-answering tasks and demonstrate strong generalization across both text and vision-language tasks.
๐ Abstract
The open-source model ecosystem now contains hundreds of thousands of pretrained models, yet picking the best model for a new dataset is increasingly infeasible: new models and unbenchmarked datasets emerge continuously, leaving practitioners with no prior records on either side. Existing approaches handle only fragments of this in-the-wild setting: AutoML and transferability estimation select models from small predefined pools or require expensive per-model forward passes on the target dataset, while model routing presupposes a given candidate pool. We introduce ModelLens, a unified framework for model recommendation in the wild. Our key insight is that public leaderboard interactions, though scattered and noisy, collectively trace out an implicit atlas of model capabilities across heterogeneous evaluation settings, a signal rich enough to learn from directly. By learning a performance-aware latent space over model--dataset--metric tuples, ModelLens ranks unseen models on unseen datasets without running candidates on the target dataset. On a new benchmark of 1.62M evaluation records spanning 47K models and 9.6K datasets, ModelLens surpasses baselines that either rely on metadata alone or require running each candidate on the target dataset. Its recommended Top-K pools further improve multiple representative routing methods by up to 81% across diverse QA benchmarks. Case studies on recently released benchmarks further confirm generalization to both text and vision-language tasks.