Zerrow: True Zero-Copy Arrow Pipelines in Bauplan

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Apache Arrow’s “zero-copy” data sharing in FaaS-based data pipelines remains unrealized due to insufficient Linux kernel support for shared memory, forcing costly user-space memory mapping and intermediate data copies. Method: This paper proposes Bauplan, a lakehouse architecture featuring a novel kernel-level de-anonymization module that bypasses user-space memory mapping constraints, enabling end-to-end zero-copy transmission of Arrow data across DAG nodes. Contribution/Results: Bauplan achieves near 100% zero-copy rate in Arrow pipelines—eliminating all intermediate data copies for the first time. Experiments across diverse representative workloads show significant reductions in data transfer latency, over 98% lower memory bandwidth consumption, and up to 3.2× higher end-to-end throughput. By addressing a fundamental systems-level bottleneck, Bauplan delivers a practical, high-performance solution for serverless data processing.

Technology Category

Application Category

📝 Abstract

Bauplan is a FaaS-based lakehouse specifically built for data pipelines: its execution engine uses Apache Arrow for data passing between the nodes in the DAG. While Arrow is known as the"zero copy format", in practice, limited Linux kernel support for shared memory makes it difficult to avoid copying entirely. In this work, we introduce several new techniques to eliminate nearly all copying from pipelines: in particular, we implement a new kernel module that performs de-anonymization, thus eliminating a copy to intermediate data. We conclude by sharing our preliminary evaluation on different workloads types, as well as discussing our plan for future improvements.

Problem

Research questions and friction points this paper is trying to address.

Eliminate data copying in Arrow pipelines

Implement kernel module for de-anonymization

Improve shared memory support in Linux

Innovation

Methods, ideas, or system contributions that make the work stand out.

New kernel module for de-anonymization

True zero-copy Arrow pipelines

FaaS-based lakehouse for data pipelines

🔎 Similar Papers

FreeRide: Harvesting Bubbles in Pipeline Parallelism