๐ค AI Summary
This work addresses the challenges posed by the massive and unstructured time-series data generated in industrial cyber-physical systems (CPS), where existing preprocessing approaches rely on ad hoc scripts that suffer from poor readability, reusability, and maintainability. To overcome these limitations, the authors propose and implement CPSLint, the first domain-specific language (DSL) tailored for industrial CPS data preprocessing. CPSLint abstracts common data cleaning and validation operations into a concise and expressive syntax, enabling cross-scenario reuse and significantly improving both data preparation efficiency and team collaboration. The DSL has been open-sourced, and experimental results demonstrate that complex preprocessing tasks can be accomplished in just a few lines of code, substantially reducing redundant development efforts. CPSLint thus establishes a scalable and standardized paradigm for industrial time-series data processing.
๐ Abstract
Raw datasets are often too large and unstructured to work with directly, and require a data preparation phase. The domain of industrial Cyber-Physical Systems (CPSs) is no exception, as raw data typically consists of large time-series data collections that log the system's status at regular time intervals. The processing of such raw data is often carried out using ad hoc, case-specific, one-off Python scripts, often neglecting aspects of readability, reusability, and maintainability. In practice, this can cause professionals such as data scientists to write similar data preparation scripts for each case, requiring them to do much repetitive work. We introduce CPSLint, a Domain-Specific Language (DSL) designed to support the data preparation process for industrial CPS. CPSLint raises the level of abstraction to the point where both data scientists and domain experts can perform the data preparation task. We leverage the fact that many raw data collections in the industrial CPS domain require similar actions to render them suitable for data-centric workflows. In our DSL one can express the data preparation process in just a few lines of code. CPSLint is a publicly available tool applicable for any case involving time-series data collections in need of sanitisation.