Unreflected Use of Tabular Data Repositories Can Undermine Research Quality

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper exposes a scientific integrity crisis arising from the uncritical use of tabular datasets from repositories such as OpenML, identifying three methodological pitfalls: inappropriate model selection, absence of rigorous baselines, and inconsistent preprocessing. Through systematic case studies, empirical review of top-tier conference papers, and methodological critique, it establishes—for the first time—that inadequate governance of data repositories poses a structural threat to scientific rigor. The core contribution is the first comprehensive quality assurance framework for data repositories, comprising reproducible benchmark construction, standardized preprocessing audit protocols, and principled model selection guidelines. It further advocates for community-wide adoption of strong baseline evaluation standards and ethical data usage norms. The framework has already uncovered widespread methodological flaws in multiple influential conference publications, thereby providing both a methodological foundation and a governance roadmap for trustworthy machine learning research.

Technology Category

Application Category

📝 Abstract
Data repositories have accumulated a large number of tabular datasets from various domains. Machine Learning researchers are actively using these datasets to evaluate novel approaches. Consequently, data repositories have an important standing in tabular data research. They not only host datasets but also provide information on how to use them in supervised learning tasks. In this paper, we argue that, despite great achievements in usability, the unreflected usage of datasets from data repositories may have led to reduced research quality and scientific rigor. We present examples from prominent recent studies that illustrate the problematic use of datasets from OpenML, a large data repository for tabular data. Our illustrations help users of data repositories avoid falling into the traps of (1) using suboptimal model selection strategies, (2) overlooking strong baselines, and (3) inappropriate preprocessing. In response, we discuss possible solutions for how data repositories can prevent the inappropriate use of datasets and become the cornerstones for improved overall quality of empirical research studies.
Problem

Research questions and friction points this paper is trying to address.

Unreflected use of tabular datasets reduces research quality.
Suboptimal model selection strategies undermine scientific rigor.
Inappropriate preprocessing and overlooked baselines affect dataset usability.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Highlight suboptimal model selection strategies
Address overlooking strong baselines issue
Propose solutions for inappropriate preprocessing
🔎 Similar Papers
No similar papers found.