Sliced-Wasserstein Distance-based Data Selection

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of interpretability and conservatism in machine learning training data selection for safety-critical domains such as power systems. We propose an unsupervised data filtering method based on the sliced Wasserstein distance (SWD), marking the first application of optimal transport theory to training set curation. Our approach establishes an interpretable and robust conservative screening mechanism. By integrating dimensionality-reduced feature representations, lightweight Euclidean distance approximations, and synthetic data validation, we enhance the stability of anomaly detection. Key contributions include: (1) releasing the first publicly available benchmark dataset for localized peak demand response under Nordic climatic conditions; and (2) conducting the inaugural load forecasting benchmark study on this dataset, demonstrating that filtered training data significantly improve downstream task performance—e.g., reducing prediction error by up to 18.3% across multiple models. The framework ensures both theoretical grounding and practical efficacy, advancing trustworthy data curation for critical infrastructure analytics.

Technology Category

Application Category

📝 Abstract
We propose a new unsupervised anomaly detection method based on the sliced-Wasserstein distance for training data selection in machine learning approaches. Our filtering technique is interesting for decision-making pipelines deploying machine learning models in critical sectors, e.g., power systems, as it offers a conservative data selection and an optimal transport interpretation. To ensure the scalability of our method, we provide two efficient approximations. The first approximation processes reduced-cardinality representations of the datasets concurrently. The second makes use of a computationally light Euclidian distance approximation. Additionally, we open the first dataset showcasing localized critical peak rebate demand response in a northern climate. We present the filtering patterns of our method on synthetic datasets and numerically benchmark our method for training data selection. Finally, we employ our method as part of a first forecasting benchmark for our open-source dataset.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised anomaly detection for training data selection
Scalable approximations for large dataset processing
Conservative data filtering for critical decision-making sectors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses sliced-Wasserstein distance for data selection
Provides two efficient scalable approximations
Introduces first northern climate demand dataset
🔎 Similar Papers
No similar papers found.