Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards

๐Ÿ“… 2026-04-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

219K/year
๐Ÿค– AI Summary
Current leaderboards for large language models rely on static benchmarks that fail to capture the diversity of user needs, and their single aggregate scores obscure performance variations across different prompt types. This work addresses these limitations by analyzing data from the LMArena benchmark, uncovering issues such as thematic skew and ambiguous scoring. To overcome these challenges, the authors propose a user-centered, interactive evaluation paradigm that integrates data slicing, preference modeling, and visual design to enable users to define custom prompt slices and weighting schemes, thereby dynamically exploring model rankings. Qualitative studies demonstrate that this approach significantly enhances evaluation transparency and contextual adaptability, empowering users to select models that best align with their specific requirements.

Technology Category

Application Category

๐Ÿ“ Abstract
LLM leaderboards are widely used to compare models and guide deployment decisions. However, leaderboard rankings are shaped by evaluation priorities set by benchmark designers, rather than by the diverse goals and constraints of actual users and organizations. A single aggregate score often obscures how models behave across different prompt types and compositions. In this work, we conduct an in-depth analysis of the dataset used in the LMArena (formerly Chatbot Arena) benchmark and investigate this evaluation challenge by designing an interactive visualization interface as a design probe. Our analysis reveals that the dataset is heavily skewed toward certain topics, that model rankings vary across prompt slices, and that preference-based judgments are used in ways that blur their intended scope. Building on this analysis, we introduce a visualization interface that allows users to define their own evaluation priorities by selecting and weighting prompt slices and to explore how rankings change accordingly. A qualitative study suggests that this interactive approach improves transparency and supports more context-specific model evaluation, pointing toward alternative ways to design and use LLM leaderboards.
Problem

Research questions and friction points this paper is trying to address.

LLM leaderboards
user-defined evaluation
prompt slices
model ranking
evaluation transparency
Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive evaluation
user-defined metrics
LLM leaderboard
prompt slicing
visualization interface