🤖 AI Summary
This work addresses the challenges of data provenance in streaming systems—namely high computational overhead, substantial storage costs, and limited scalability—which have restricted existing approaches primarily to debugging scenarios and hindered broader data management applications. To overcome these limitations, the study introduces temporal interaction networks (TINs) into the provenance domain for the first time and proposes an efficient time-aware provenance framework tailored for data management. It formally defines two classes of data—discrete and liquid—and five types of temporal provenance queries, accompanied by a state-based indexing mechanism. Experimental evaluations on Apache Flink demonstrate significant reductions in both storage and computational costs, while case studies in transportation and finance underscore the framework’s scalability and practical utility, thereby extending the applicability and performance boundaries of traditional provenance techniques.
📝 Abstract
Data provenance (the process of determining the origin and derivation of data outputs) has applications across multiple domains including explaining database query results and auditing scientific workflows. Despite decades of research, provenance tracing remains challenging due to computational costs and storage overhead. In streaming systems such as Apache Flink, provenance graphs can grow super-linearly with data volume, posing significant scalability challenges. Temporal provenance is a promising direction, attaching timestamps to provenance information, enabling time-focused queries without maintaining complete historical records. However, existing temporal provenance methods primarily focus on system-level debugging, leaving a gap in data management applications. This paper proposes an agenda that uses Temporal Interaction Networks (TINs) to represent temporal provenance efficiently. We demonstrate TINs'applicability across streaming systems, transportation networks, and financial networks. We classify data into discrete and liquid types, define five temporal provenance query types, and propose a state-based indexing approach. Our vision outlines research directions toward making temporal provenance a practical tool for large-scale dataflows.