🤖 AI Summary
Existing PF/OPF datasets suffer from three key limitations: (1) inaccurate perturbation modeling—lacking realistic temporal load variations and N-k topological contingencies; (2) PF samples confined to the feasible region, omitting critical constraint violations (e.g., line overloads, voltage limit violations); and (3) fixed OPF cost functions, limiting generalizability. To address these, we introduce the first open-source Python library supporting large-scale power systems (up to 10,000 buses). Our method pioneers a unified data generation paradigm integrating realistic load scaling, localized noise injection, and arbitrary N-k topology perturbations. It systematically produces comprehensive PF samples—including all types of constraint violations—and diverse OPF datasets with multiple cost functions. Implemented via high-performance parallel computation on PyPower/Pandapower, our framework achieves over 3× greater scenario diversity and 100% coverage of violation states compared to tools like OPFData, significantly enhancing robustness and generalization of ML-based OPF solvers.
📝 Abstract
We introduce gridfm-datakit-v1, a Python library for generating realistic and diverse Power Flow (PF) and Optimal Power Flow (OPF) datasets for training Machine Learning (ML) solvers. Existing datasets and libraries face three main challenges: (1) lack of realistic stochastic load and topology perturbations, limiting scenario diversity; (2) PF datasets are restricted to OPF-feasible points, hindering generalization of ML solvers to cases that violate operating limits (e.g., branch overloads or voltage violations); and (3) OPF datasets use fixed generator cost functions, limiting generalization across varying costs. gridfm-datakit addresses these challenges by: (1) combining global load scaling from real-world profiles with localized noise and supporting arbitrary N-k topology perturbations to create diverse yet realistic datasets; (2) generating PF samples beyond operating limits; and (3) producing OPF data with varying generator costs. It also scales efficiently to large grids (up to 10,000 buses). Comparisons with OPFData, OPF-Learn, PGLearn, and PF$Δ$ are provided. Available on GitHub at https://github.com/gridfm/gridfm-datakit under Apache 2.0 and via `pip install gridfm-datakit`.