How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Current transferability estimation benchmarks suffer from fundamental flaws—namely, unrealistic fixed model spaces and static performance hierarchies—which severely distort evaluation outcomes; simple, dataset-agnostic heuristics frequently outperform sophisticated metrics, exposing a critical mismatch between benchmark protocols and real-world model selection scenarios. Method: The authors conduct a systematic empirical re-evaluation of mainstream transferability metrics across diverse, realistic model spaces and dynamically varying performance rankings. Contribution/Results: They quantitatively identify and characterize the primary sources of benchmark bias for the first time. Crucially, they demonstrate that the prevailing evaluation paradigm is unreliable, propose a novel benchmarking framework that is realistic, dynamic, and task-aware, and provide both theoretical foundations and practical guidelines for designing robust transferability assessment systems.

Technology Category

Application Category

📝 Abstract

Transferability estimation metrics are used to find a high-performing pre-trained model for a given target task without fine-tuning models and without access to the source dataset. Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined. In this work, we empirically show the shortcomings of widely used benchmark setups to evaluate transferability estimation metrics. We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed. We empirically demonstrate that their unrealistic model spaces and static performance hierarchies artificially inflate the perceived performance of existing metrics, to the point where simple, dataset-agnostic heuristics can outperform sophisticated methods. Our analysis reveals a critical disconnect between current evaluation protocols and the complexities of real-world model selection. To address this, we provide concrete recommendations for constructing more robust and realistic benchmarks to guide future research in a more meaningful direction.

Problem

Research questions and friction points this paper is trying to address.

Exposing flaws in current transferability metric evaluation benchmarks

Demonstrating unrealistic benchmarks artificially inflate metric performance

Proposing robust benchmark guidelines for realistic model selection evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes realistic benchmarks for transferability estimation metrics

Identifies flaws in static model selection hierarchies

Recommends dataset-agnostic heuristics for model evaluation

🔎 Similar Papers

On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards