🤖 AI Summary
This work addresses safety issues in lakehouse systems—such as runtime-exposed upstream-downstream mismatches and partial effect leakage—that arise under concurrent multi-party operations. To resolve these challenges, the authors propose Bauplan, a code-first lakehouse system that integrates software engineering principles into the lakehouse architecture by introducing typed data contracts, Git-like data versioning, and pipeline-level atomic transactions. This design inherently prevents the representation and generation of invalid states. The paper further develops a lightweight formal transaction model and provides preliminary validation of its efficacy, establishing a new paradigm for building lakehouse systems that are auditable, reproducible, and strongly consistent.
📝 Abstract
Lakehouses are the default cloud platform for analytics and AI, but they become unsafe when untrusted actors concurrently operate on production data: upstream-downstream mismatches surface only at runtime, and multi-table pipelines can leak partial effects. Inspired by software engineering, we design Bauplan, a code-first lakehouse that aims to make (most) illegal states unrepresentable using familiar abstractions. Bauplan acts along three axes: typed table contracts to make pipeline boundaries checkable, Git-like data versioning for review and reproducibility, and transactional runs that guarantee pipeline-level atomicity. We report early results from a lightweight formal transaction model and discuss future work motivated by counterexamples.