A Position Paper on the Automatic Generation of Machine Learning Leaderboards

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Manually constructing and maintaining machine learning leaderboards is costly and suffers from inconsistent evaluation standards; existing automatic leaderboard generation (ALG) research lacks a unified problem formulation, hindering cross-study comparison and reproducibility. Method: We propose the first unified conceptual framework for ALG, defining its core task as precise extraction of structured experimental entries—including models, datasets, metrics, numerical results, and contextual metadata—from research papers, coupled with cross-paper alignment for consistency. Our approach integrates literature analysis and task abstraction to design a novel paradigm covering all result types and fine-grained metadata, and introduces a standardized evaluation protocol assessing three dimensions: information extraction accuracy, structural completeness, and cross-paper comparability. Contribution/Results: The work establishes a community-agreed benchmarking guideline, providing both theoretical foundations and practical pathways for automated scientific infrastructure in ML research.

Technology Category

Application Category

📝 Abstract
An important task in machine learning (ML) research is comparing prior work, which is often performed via ML leaderboards: a tabular overview of experiments with comparable conditions (e.g., same task, dataset, and metric). However, the growing volume of literature creates challenges in creating and maintaining these leaderboards. To ease this burden, researchers have developed methods to extract leaderboard entries from research papers for automated leaderboard curation. Yet, prior work varies in problem framing, complicating comparisons and limiting real-world applicability. In this position paper, we present the first overview of Automatic Leaderboard Generation (ALG) research, identifying fundamental differences in assumptions, scope, and output formats. We propose an ALG unified conceptual framework to standardise how the ALG task is defined. We offer ALG benchmarking guidelines, including recommendations for datasets and metrics that promote fair, reproducible evaluation. Lastly, we outline challenges and new directions for ALG, such as, advocating for broader coverage by including all reported results and richer metadata.
Problem

Research questions and friction points this paper is trying to address.

Standardizing Automatic Leaderboard Generation (ALG) task definitions and frameworks
Providing benchmarking guidelines for fair and reproducible ALG evaluation
Addressing challenges in ALG coverage and metadata inclusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes unified framework for leaderboard generation
Recommends benchmarking guidelines for evaluation
Advocates broader coverage with richer metadata
🔎 Similar Papers
No similar papers found.
R
Roelien C Timmer
CSIRO Data61, Australia
Stephen Wan
Stephen Wan
Data61 CSIRO
computational linguistics
Y
Yufang Hou
IT:U Interdisciplinary Transformation University Austria, Austria