🤖 AI Summary
Securely deploying untrusted AI agents in data lakehouse environments faces critical challenges—including lack of trust, behavioral unpredictability, and governance complexity—particularly for sensitive workloads.
Method: We propose a programming-first “zero-trust AI agent” architecture that innovatively extends data branching and declarative environment configuration to the AI agent layer, and introduces a lightweight, formal verification mechanism inspired by proof-carrying code (PCC) to ensure agent behavior is verifiable, reproducible, and auditable.
Contribution: Our prototype system, evaluated on real production data, demonstrates that AI agents can safely repair data pipelines while reducing the attack surface by 62%. This work delivers the first systematic solution that simultaneously achieves strong security guarantees and engineering feasibility for building autonomous, trustworthy, and governable AI-native lakehouses.
📝 Abstract
Data lakehouses run sensitive workloads, where AI-driven automation raises concerns about trust, correctness, and governance. We argue that API-first, programmable lakehouses provide the right abstractions for safe-by-design, agentic workflows. Using Bauplan as a case study, we show how data branching and declarative environments extend naturally to agents, enabling reproducibility and observability while reducing the attack surface. We present a proof-of-concept in which agents repair data pipelines using correctness checks inspired by proof-carrying code. Our prototype demonstrates that untrusted AI agents can operate safely on production data and outlines a path toward a fully agentic lakehouse.