🤖 AI Summary
Privacy constraints hinder the sharing of real-world water distribution network (WDN) models, impeding the development of data-driven methods. To address this, we propose DiTEC-WDN—the first large-scale, publicly licensed synthetic WDN simulation dataset. It encompasses 36,000 operational scenarios spanning 24-hour and annual hydraulic conditions, yielding 228 million graph-structured hydraulic states. Our methodology integrates automated EPANET-based simulation, multi-objective parameter optimization, graph-state encoding, and rule-based consistency verification—ensuring hydraulic fidelity while eliminating privacy risks entirely. DiTEC-WDN supports multi-granularity tasks, including graph-level, node-level, edge-level regression, and time-series forecasting. It has already enabled training and benchmarking of multiple AI models for water systems, filling a critical gap in publicly available benchmarks and advancing standardization in AI research for the water industry.
📝 Abstract
Privacy restrictions hinder the sharing of real-world Water Distribution Network (WDN) models, limiting the application of emerging data-driven machine learning, which typically requires extensive observations. To address this challenge, we propose the dataset DiTEC-WDN that comprises 36,000 unique scenarios simulated over either short-term (24 hours) or long-term (1 year) periods. We constructed this dataset using an automated pipeline that optimizes crucial parameters (e.g., pressure, flow rate, and demand patterns), facilitates large-scale simulations, and records discrete, synthetic but hydraulically realistic states under standard conditions via rule validation and post-hoc analysis. With a total of 228 million generated graph-based states, DiTEC-WDN can support a variety of machine-learning tasks, including graph-level, node-level, and link-level regression, as well as time-series forecasting. This contribution, released under a public license, encourages open scientific research in the critical water sector, eliminates the risk of exposing sensitive data, and fulfills the need for a large-scale water distribution network benchmark for study comparisons and scenario analysis.