Zero-Scan Data Quality: Leveraging Table Format Metadata for Continuous Observability at Scale

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost and poor scalability of traditional data quality monitoring, which relies on full-table scans. The authors propose a zero-scan approach that leverages table-format metadata from Apache Iceberg, systematically utilizing column-level statistics—such as Theta sketches for distinct count estimation and KLL sketches for quantile approximation—generated at write time. Incremental, mergeable statistics are stored in Puffin sidecar files, enabling anomaly detection, distribution drift monitoring, and null-ratio tracking without accessing the underlying data. Deployed at LinkedIn across over 200,000 Iceberg tables (800+ PB), the method supports approximately 60% of user-defined rules at zero marginal compute cost, reduces resource consumption by 50%, and achieves near 90% coverage after extension.
📝 Abstract
Modern table formats such as Apache Iceberg compute and store metadata-commit timestamps, record counts, and column-level statistics such as null counts and value bounds at write time as part of file writing. These statistics serve query planning, yet they overlap substantially with data quality (DQ) monitoring needs. We describe a metadata-first approach that repurposes write-time statistics for continuous DQ observability: anomaly detection, drift monitoring, null-rate tracking; without scanning any data. Deployed at LinkedIn across 200,000+ Iceberg tables (800+ PB), this approach satisfies approximately 60% of user-defined DQ rules at zero marginal compute cost and reduces profiling resource consumption by around 50%. Extending manifest statistics with lightweight counters (sum, zero-value counts, boolean counts) and incrementally mergeable sketches; Theta sketches for distinct counts, KLL sketches for quantiles; can further raise metadata-satisfiable coverage to close to 90% of production DQ rules. We validate sketch accuracy, mergeability, and storage overhead on production data and propose that table formats should store per-file sketches in Puffin sidecar files, following the same store-then-aggregate pattern used for existing manifest statistics.
Problem

Research questions and friction points this paper is trying to address.

data quality
metadata
table formats
observability
zero-scan
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Scan
Metadata-Driven Observability
Data Quality Monitoring
Sketch-Based Statistics
Table Format Optimization