🤖 AI Summary
This study investigates the paradoxical use of and trust in large language model (LLM) leaderboards within computer science research practice. Through semi-structured interviews with eight researchers across four subfields and reflexive thematic analysis, it empirically uncovers a “pragmatic skepticism” paradox: despite widespread doubts about leaderboard validity, researchers routinely employ them as rough decision aids. The findings reveal that disciplinary culture—not individual attitudes—primarily shapes usage patterns, with peer networks serving as the dominant mechanism for model selection, and head-to-head competition-style leaderboards perceived as superior to static benchmarks. Seven participants strongly advocated for greater cost transparency. In response, the paper proposes a new leaderboard design paradigm centered on task decomposition, explicit cost disclosure, and transparency in rater demographics.
📝 Abstract
Large language model (LLM) leaderboards rank AI models using standardized benchmarks and have become highly visible across computer science, despite known limitations in their reliability and robustness. Yet how they shape researchers' actual practice remains empirically uncharted. We address this gap through semi-structured interviews with eight researchers across four computer science subfields, analyzed using reflexive thematic analysis. We find a near-universal paradox of pragmatic skepticism: while participants expressed deep distrust of leaderboard rankings, they continued to use them as rough decision-making aids. Peer networks, not leaderboards, emerged as the primary model selection mechanism, and arena-based (human-voting) leaderboards were consistently preferred over static benchmark leaderboards. Leaderboard influence varied sharply across subfields, revealing that disciplinary culture, not individual attitudes, mediates engagement; for instance, NLP researchers faced state-of-the-art comparison pressure while HCI and Systems/Privacy researchers reported none. Across these differences, however, participants converged on cost transparency as the most demanded missing feature (seven of eight). We translate these findings into concrete design recommendations that align evaluation infrastructure with how researchers actually use it, such as task-specific score breakdowns, cost integration, and voter-demographic disclosure.