🤖 AI Summary
In AI-enhanced data workflows, human-AI collaboration, tool heterogeneity, and opaque model decisions impede comprehensive metadata capture, severely hindering data lineage tracing and reproducibility. To address this, we propose TableVault—a novel metadata governance framework that uniquely integrates database-grade reliability guarantees with AI-native design principles. TableVault introduces a declarative operation builder, lineage-aware references, and fine-grained execution-state tracking to enable end-to-end lineage reconstruction and operational context preservation across heterogeneous tools and in partially observable environments. It further establishes a standardized metadata layer that unifies metadata representation across ingestion, transformation, and consumption stages. Evaluated on a document classification task, TableVault fully captures lifecycle metadata of complex AI pipelines, yielding substantial improvements in transparency, auditability, and reproducibility—demonstrating robustness under realistic, fragmented tooling conditions.
📝 Abstract
AI-augmented data workflows introduce complex governance challenges, as both human and model-driven processes generate, transform, and consume data artifacts. These workflows blend heterogeneous tools, dynamic execution patterns, and opaque model decisions, making comprehensive metadata capture difficult. In this work, we present TableVault, a metadata governance framework designed for human-AI collaborative data creation. TableVault records ingestion events, traces operation status, links execution parameters to their data origins, and exposes a standardized metadata layer. By combining database-inspired guarantees with AI-oriented design, such as declarative operation builders and lineage-aware references, TableVault supports transparency and reproducibility across mixed human-model pipelines. Through a document classification case study, we demonstrate how TableVault preserves detailed lineage and operational context, enabling robust metadata management, even in partially observable execution environments.