TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing time-series AI agent benchmarks suffer from poor scalability, narrow task scope, and limited evaluation modalities—relying solely on CSV submissions—thus failing to reflect real-world ML engineering practices. Method: We propose the first scalable, time-series–focused agent evaluation framework for ML engineering, supporting multi-dimensional tasks including data preprocessing, code understanding, repository analysis, and model submission, with native support for multimodal outputs (code, models, files). Our framework features a dual-axis scalability design: (1) a cross-domain task composition mechanism, and (2) a hybrid evaluation paradigm integrating deterministic metrics with LLM-based semantic assessment. Built modularly in Python, it incorporates automated sandboxing and a multi-granularity evaluation engine, generalizable beyond time-series domains. Contribution/Results: Validated on 12 diverse time-series challenges, the framework enables systematic, quantitative assessment of AI agents’ engineering capabilities. All components are fully open-sourced.

Technology Category

Application Category

📝 Abstract
We introduce TimeSeriesGym, a scalable benchmarking framework for evaluating Artificial Intelligence (AI) agents on time series machine learning engineering challenges. Existing benchmarks lack scalability, focus narrowly on model building in well-defined settings, and evaluate only a limited set of research artifacts (e.g., CSV submission files). To make AI agent benchmarking more relevant to the practice of machine learning engineering, our framework scales along two critical dimensions. First, recognizing that effective ML engineering requires a range of diverse skills, TimeSeriesGym incorporates challenges from diverse sources spanning multiple domains and tasks. We design challenges to evaluate both isolated capabilities (including data handling, understanding research repositories, and code translation) and their combinations, and rather than addressing each challenge independently, we develop tools that support designing multiple challenges at scale. Second, we implement evaluation mechanisms for multiple research artifacts, including submission files, code, and models, using both precise numeric measures and more flexible LLM-based evaluation approaches. This dual strategy balances objective assessment with contextual judgment. Although our initial focus is on time series applications, our framework can be readily extended to other data modalities, broadly enhancing the comprehensiveness and practical utility of agentic AI evaluation. We open-source our benchmarking framework to facilitate future research on the ML engineering capabilities of AI agents.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents on diverse time series ML engineering challenges
Addressing scalability gaps in current benchmarking frameworks
Assessing multiple research artifacts with hybrid evaluation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable benchmarking framework for AI agents
Diverse challenges across multiple domains and tasks
Dual evaluation with numeric and LLM-based methods