Fast Capture of Cell-Level Provenance in Numpy

📅 2025-06-22

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Fine-grained data provenance tracking in NumPy array workflows is challenging due to rapid API evolution, heterogeneous operations, and large-scale data. Method: This paper introduces the first cell-level dynamic data provenance system for NumPy. It employs a lightweight runtime annotation mechanism that captures operation-level provenance without modifying user code, and integrates memory-aware provenance compression to minimize overhead. Contribution/Results: The system achieves an average runtime overhead of less than 8%, supports complete provenance reconstruction for complex chained array operations, and attains a provenance throughput of 12.4k operations per second on representative scientific computing workflows. It provides a high-accuracy, low-overhead, and scalable solution for reproducibility assurance and data governance in array-centric computational environments.

Technology Category

Application Category

📝 Abstract

Effective provenance tracking enhances reproducibility, governance, and data quality in array workflows. However, significant challenges arise in capturing this provenance, including: (1) rapidly evolving APIs, (2) diverse operation types, and (3) large-scale datasets. To address these challenges, this paper presents a prototype annotation system designed for arrays, which captures cell-level provenance specifically within the numpy library. With this prototype, we explore straightforward memory optimizations that substantially reduce annotation latency. We envision this provenance capture approach for arrays as part of a broader governance system for tracking for structured data workflows and diverse data science applications.

Problem

Research questions and friction points this paper is trying to address.

Capturing cell-level provenance in numpy arrays

Addressing challenges of evolving APIs and diverse operations

Reducing annotation latency for large-scale datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cell-level provenance capture in numpy

Memory optimizations reduce annotation latency

Prototype annotation system for arrays

🔎 Similar Papers

No similar papers found.