Building a Correct-by-Design Lakehouse. Data Contracts, Versioning, and Transactional Pipelines for Humans and Agents

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses safety issues in lakehouse systems—such as runtime-exposed upstream-downstream mismatches and partial effect leakage—that arise under concurrent multi-party operations. To resolve these challenges, the authors propose Bauplan, a code-first lakehouse system that integrates software engineering principles into the lakehouse architecture by introducing typed data contracts, Git-like data versioning, and pipeline-level atomic transactions. This design inherently prevents the representation and generation of invalid states. The paper further develops a lightweight formal transaction model and provides preliminary validation of its efficacy, establishing a new paradigm for building lakehouse systems that are auditable, reproducible, and strongly consistent.

Technology Category

Application Category

📝 Abstract
Lakehouses are the default cloud platform for analytics and AI, but they become unsafe when untrusted actors concurrently operate on production data: upstream-downstream mismatches surface only at runtime, and multi-table pipelines can leak partial effects. Inspired by software engineering, we design Bauplan, a code-first lakehouse that aims to make (most) illegal states unrepresentable using familiar abstractions. Bauplan acts along three axes: typed table contracts to make pipeline boundaries checkable, Git-like data versioning for review and reproducibility, and transactional runs that guarantee pipeline-level atomicity. We report early results from a lightweight formal transaction model and discuss future work motivated by counterexamples.
Problem

Research questions and friction points this paper is trying to address.

Lakehouse
Data Contracts
Transactional Pipelines
Data Versioning
Concurrency
Innovation

Methods, ideas, or system contributions that make the work stand out.

data contracts
data versioning
transactional pipelines
correct-by-design
lakehouse
🔎 Similar Papers
No similar papers found.
W
Weiming Sheng
Columbia University
J
Jinlang Wang
University of Wisconsin-Madison
M
Manuel Barros
Carnegie Mellon University
A
Aldrin Montana
Bauplan Labs
Jacopo Tagliabue
Jacopo Tagliabue
NYU
Artificial IntelligenceNLPCognitive Sciences
L
Luca Bigon
Bauplan Labs