🤖 AI Summary
This work addresses the lack of interpretability and conservatism in machine learning training data selection for safety-critical domains such as power systems. We propose an unsupervised data filtering method based on the sliced Wasserstein distance (SWD), marking the first application of optimal transport theory to training set curation. Our approach establishes an interpretable and robust conservative screening mechanism. By integrating dimensionality-reduced feature representations, lightweight Euclidean distance approximations, and synthetic data validation, we enhance the stability of anomaly detection. Key contributions include: (1) releasing the first publicly available benchmark dataset for localized peak demand response under Nordic climatic conditions; and (2) conducting the inaugural load forecasting benchmark study on this dataset, demonstrating that filtered training data significantly improve downstream task performance—e.g., reducing prediction error by up to 18.3% across multiple models. The framework ensures both theoretical grounding and practical efficacy, advancing trustworthy data curation for critical infrastructure analytics.
📝 Abstract
We propose a new unsupervised anomaly detection method based on the sliced-Wasserstein distance for training data selection in machine learning approaches. Our filtering technique is interesting for decision-making pipelines deploying machine learning models in critical sectors, e.g., power systems, as it offers a conservative data selection and an optimal transport interpretation. To ensure the scalability of our method, we provide two efficient approximations. The first approximation processes reduced-cardinality representations of the datasets concurrently. The second makes use of a computationally light Euclidian distance approximation. Additionally, we open the first dataset showcasing localized critical peak rebate demand response in a northern climate. We present the filtering patterns of our method on synthetic datasets and numerically benchmark our method for training data selection. Finally, we employ our method as part of a first forecasting benchmark for our open-source dataset.