Building a Correct-by-Design Lakehouse. Data Contracts, Versioning, and Transactional Pipelines for Humans and Agents

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses safety issues in lakehouse systems—such as runtime-exposed upstream-downstream mismatches and partial effect leakage—that arise under concurrent multi-party operations. To resolve these challenges, the authors propose Bauplan, a code-first lakehouse system that integrates software engineering principles into the lakehouse architecture by introducing typed data contracts, Git-like data versioning, and pipeline-level atomic transactions. This design inherently prevents the representation and generation of invalid states. The paper further develops a lightweight formal transaction model and provides preliminary validation of its efficacy, establishing a new paradigm for building lakehouse systems that are auditable, reproducible, and strongly consistent.

Technology Category

Application Category

📝 Abstract

Lakehouses are the default cloud platform for analytics and AI, but they become unsafe when untrusted actors concurrently operate on production data: upstream-downstream mismatches surface only at runtime, and multi-table pipelines can leak partial effects. Inspired by software engineering, we design Bauplan, a code-first lakehouse that aims to make (most) illegal states unrepresentable using familiar abstractions. Bauplan acts along three axes: typed table contracts to make pipeline boundaries checkable, Git-like data versioning for review and reproducibility, and transactional runs that guarantee pipeline-level atomicity. We report early results from a lightweight formal transaction model and discuss future work motivated by counterexamples.

Problem

Research questions and friction points this paper is trying to address.

Lakehouse

Data Contracts

Transactional Pipelines

Data Versioning

Concurrency

Innovation

Methods, ideas, or system contributions that make the work stand out.

data contracts

data versioning

transactional pipelines