STEB: In Search of the Best Evaluation Approach for Synthetic Time Series

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of objective, large-scale benchmarks for evaluating synthetic time series, this paper introduces STEB—the first standardized evaluation benchmark. STEB comprises ten real and synthetic datasets, integrates thirteen configurable transformations and stochasticity-injection mechanisms, and supports parallel/sequential evaluation alongside runtime and error tracking. We propose a dual-dimension evaluation framework—“reliability” and “consistency”—and systematically benchmark 41 mainstream evaluation metrics for the first time. Empirical analysis reveals that time-series embeddings exert a decisive influence on metric outcomes: different embeddings substantially alter metric rankings and score stability. The open-source STEB framework includes a fully reproducible evaluation pipeline, establishing a scalable, interpretable, and automated paradigm for synthetic time-series assessment.

Technology Category

Application Category

📝 Abstract
The growing need for synthetic time series, due to data augmentation or privacy regulations, has led to numerous generative models, frameworks, and evaluation measures alike. Objectively comparing these measures on a large scale remains an open challenge. We propose the Synthetic Time series Evaluation Benchmark (STEB) -- the first benchmark framework that enables comprehensive and interpretable automated comparisons of synthetic time series evaluation measures. Using 10 diverse datasets, randomness injection, and 13 configurable data transformations, STEB computes indicators for measure reliability and score consistency. It tracks running time, test errors, and features sequential and parallel modes of operation. In our experiments, we determine a ranking of 41 measures from literature and confirm that the choice of upstream time series embedding heavily impacts the final score.
Problem

Research questions and friction points this paper is trying to address.

Objectively comparing synthetic time series evaluation measures
Comprehensive automated benchmark for evaluation measures
Assessing reliability and consistency of 41 measures
Innovation

Methods, ideas, or system contributions that make the work stand out.

STEB benchmark for synthetic time series evaluation
Uses 10 datasets and 13 configurable transformations
Ranks 41 measures with reliability indicators
🔎 Similar Papers
No similar papers found.