TabArena: A Living Benchmark for Machine Learning on Tabular Data

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing tabular data benchmarks are predominantly static, failing to adapt to evolving data, model advancements, and shifting evaluation requirements. Method: We introduce the first continuously maintained “living benchmark” paradigm for tabular learning, enabling dynamic iteration of datasets, model repositories, and evaluation protocols. Built upon human-curated datasets and high-quality open-source models, our large-scale, reproducible experimental platform systematically investigates the impact of validation strategies, hyperparameter optimization, and cross-model ensembling. Contributions/Results: Key findings include: (1) gradient-boosted trees retain strong baseline performance in practical settings; (2) deep learning matches or exceeds traditional methods under sufficient computational resources; (3) foundation models excel in few-shot tabular tasks; and (4) cross-model ensembling achieves new state-of-the-art results across multiple benchmarks. We further establish standardized maintenance and release protocols to promote sustainability and scalability in tabular learning benchmarking.

Technology Category

Application Category

📝 Abstract
With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning and investigate the contributions of individual models. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.
Problem

Research questions and friction points this paper is trying to address.

Addressing the lack of dynamic benchmarks for tabular data machine learning
Evaluating performance of deep learning vs. gradient-boosted trees on tabular data
Establishing a living benchmark with public leaderboard and maintenance protocols
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuously maintained living benchmarking system
Manual curation of datasets and models
Public leaderboard with reproducible code
🔎 Similar Papers
No similar papers found.