🤖 AI Summary
To address the challenge of clustering high-dimensional, sparse, and missing-value–prone time-series data from large-scale distributed rooftop photovoltaic (PV) systems, this paper proposes the first probabilistic embedding clustering framework tailored for PV systems. It jointly encodes each system’s power generation pattern and its associated uncertainty as a probability distribution, measures inter-distribution similarity via the Wasserstein distance, and applies hierarchical clustering for robust grouping. By explicitly modeling uncertainty, the method yields interpretable, missing-data–resilient cluster-level representations—overcoming key limitations of deterministic clustering in high-noise settings. Evaluated on multi-year residential PV data, the framework significantly outperforms physics-based baselines: clusters exhibit higher representativeness and robustness (measured by silhouette score), and the learned embeddings enable high-accuracy imputation of missing values. A comprehensive hyperparameter analysis further provides practical guidelines for balancing performance and robustness.
📝 Abstract
As the number of rooftop photovoltaic (PV) installations increases, aggregators and system operators are required to monitor and analyze these systems, raising the challenge of integration and management of large, spatially distributed time-series data that are both high-dimensional and affected by missing values. In this work, a probabilistic entity embedding-based clustering framework is proposed to address these problems. This method encodes each PV system's characteristic power generation patterns and uncertainty as a probability distribution, then groups systems by their statistical distances and agglomerative clustering. Applied to a multi-year residential PV dataset, it produces concise, uncertainty-aware cluster profiles that outperform a physics-based baseline in representativeness and robustness, and support reliable missing-value imputation. A systematic hyperparameter study further offers practical guidance for balancing model performance and robustness.