Sliced-Wasserstein Distance-based Data Selection

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the lack of interpretability and conservatism in machine learning training data selection for safety-critical domains such as power systems. We propose an unsupervised data filtering method based on the sliced Wasserstein distance (SWD), marking the first application of optimal transport theory to training set curation. Our approach establishes an interpretable and robust conservative screening mechanism. By integrating dimensionality-reduced feature representations, lightweight Euclidean distance approximations, and synthetic data validation, we enhance the stability of anomaly detection. Key contributions include: (1) releasing the first publicly available benchmark dataset for localized peak demand response under Nordic climatic conditions; and (2) conducting the inaugural load forecasting benchmark study on this dataset, demonstrating that filtered training data significantly improve downstream task performance—e.g., reducing prediction error by up to 18.3% across multiple models. The framework ensures both theoretical grounding and practical efficacy, advancing trustworthy data curation for critical infrastructure analytics.

Technology Category

Application Category

📝 Abstract

We propose a new unsupervised anomaly detection method based on the sliced-Wasserstein distance for training data selection in machine learning approaches. Our filtering technique is interesting for decision-making pipelines deploying machine learning models in critical sectors, e.g., power systems, as it offers a conservative data selection and an optimal transport interpretation. To ensure the scalability of our method, we provide two efficient approximations. The first approximation processes reduced-cardinality representations of the datasets concurrently. The second makes use of a computationally light Euclidian distance approximation. Additionally, we open the first dataset showcasing localized critical peak rebate demand response in a northern climate. We present the filtering patterns of our method on synthetic datasets and numerically benchmark our method for training data selection. Finally, we employ our method as part of a first forecasting benchmark for our open-source dataset.

Problem

Research questions and friction points this paper is trying to address.

Unsupervised anomaly detection for training data selection

Scalable approximations for large dataset processing

Conservative data filtering for critical decision-making sectors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses sliced-Wasserstein distance for data selection

Provides two efficient scalable approximations

Introduces first northern climate demand dataset

🔎 Similar Papers

Effective Subset Selection Through The Lens of Neural Network Pruning