🤖 AI Summary
Current foundation model (FM) leaderboards lack standardized evaluation guidelines, resulting in insufficient transparency and difficulty in model selection. Method: Through multi-source collection of 1,045 FM leaderboards, semi-structured interviews, and card-sorting analysis, this study employs a mixed-methods approach—including data acquisition, expert interviews, consensus-based coding, and domain modeling—to systematically investigate leaderboard practices. Contribution/Results: We introduce the novel concept of *LBOps* (Leaderboard Operations) and present the first domain-specific workflow model for FM leaderboards. We identify and formally define eight types of “leaderboard smells” and categorize five canonical workflow patterns. The analysis reveals critical issues including poor traceability, inefficient collaboration, and systematic evaluation bias. Our work establishes a theoretical framework and actionable improvement pathways for standardizing FM evaluation, thereby enhancing transparency, reproducibility, and engineering utility of leaderboard-based assessment.
📝 Abstract
Foundation models (FM), such as large language models (LLMs), which are large-scale machine learning (ML) models, have demonstrated remarkable adaptability in various downstream software engineering (SE) tasks, such as code completion, code understanding, and software development. As a result, FM leaderboards have become essential tools for SE teams to compare and select the best third-party FMs for their specific products and purposes. However, the lack of standardized guidelines for FM evaluation and comparison threatens the transparency of FM leaderboards and limits stakeholders' ability to perform effective FM selection. As a first step towards addressing this challenge, our research focuses on understanding how these FM leaderboards operate in real-world scenarios ("leaderboard operations") and identifying potential pitfalls and areas for improvement ("leaderboard smells"). In this regard, we collect up to 1,045 FM leaderboards from five different sources: GitHub, Hugging Face Spaces, Papers With Code, spreadsheet and independent platform, to examine their documentation and engage in direct communication with leaderboard operators to understand their workflows. Through card sorting and negotiated agreement, we identify five distinct workflow patterns and develop a domain model that captures the key components and their interactions within these workflows. We then identify eight unique types of leaderboard smells in LBOps. By mitigating these smells, SE teams can improve transparency, accountability, and collaboration in current LBOps practices, fostering a more robust and responsible ecosystem for FM comparison and selection.