Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain Expansion, and Metric Optimization

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing ML benchmarks exhibit significant limitations in task coverage, domain diversity, difficulty modeling, and evaluation rigor, hindering comprehensive assessment of LLM agents’ capabilities across end-to-end machine learning workflows. To address this, we propose TAM Bench—the first end-to-end AutoML benchmark explicitly designed for LLM agents, supporting multimodal and heterogeneous tasks. Our method introduces a web-agent-driven automated task acquisition pipeline, a leaderboard-based quantitative difficulty scoring scheme, and a multidimensional evaluation framework integrating performance, compliance (e.g., code correctness and constraint adherence), and generalization. Built upon 150 real-world AutoML tasks, TAM Bench defines three progressively scaled subsets—Lite (18 tasks), Medium, and Full—ensuring balanced modality distribution and calibrated difficulty gradients. This design enables scalable, fine-grained, and practically grounded agent evaluation.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models (LLMs) have enabled the emergence of general-purpose agents for automating end-to-end machine learning (ML) workflows, including data analysis, feature engineering, model training, and competition solving. However, existing benchmarks remain limited in task coverage, domain diversity, difficulty modeling, and evaluation rigor, failing to capture the full capabilities of such agents in realistic settings. We present TAM Bench, a diverse, realistic, and structured benchmark for evaluating LLM-based agents on end-to-end ML tasks. TAM Bench features three key innovations: (1) A browser automation and LLM-based task acquisition system that automatically collects and structures ML challenges from platforms such as Kaggle, AIcrowd, and Biendata, spanning multiple task types and data modalities (e.g., tabular, text, image, graph, audio); (2) A leaderboard-driven difficulty modeling mechanism that estimates task complexity using participant counts and score dispersion, enabling scalable and objective task calibration; (3) A multi-dimensional evaluation framework incorporating performance, format compliance, constraint adherence, and task generalization. Based on 150 curated AutoML tasks, we construct three benchmark subsets of different sizes -- Lite, Medium, and Full -- designed for varying evaluation scenarios. The Lite version, with 18 tasks and balanced coverage across modalities and difficulty levels, serves as a practical testbed for daily benchmarking and comparative studies.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-based agents on end-to-end ML tasks
Addressing limited task coverage and domain diversity in benchmarks
Providing scalable difficulty modeling and multi-dimensional evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Browser automation for ML task collection
Leaderboard-driven difficulty modeling mechanism
Multi-dimensional evaluation framework incorporating generalization
🔎 Similar Papers
No similar papers found.
H
Hangyi Jia
Fudan University
Y
Yuxi Qian
Ant Group
H
Hanwen Tong
Ant Group
X
Xinhui Wu
Ant Group
L
Lin Chen
Ant Group
Feng Wei
Feng Wei
Assistant Professor, Orthopaedic Biomechanics Laboratories, Michigan State University
Injury BiomechanicsForensic BiomechanicsComputational Biomechanics