🤖 AI Summary
Apache Arrow’s “zero-copy” data sharing in FaaS-based data pipelines remains unrealized due to insufficient Linux kernel support for shared memory, forcing costly user-space memory mapping and intermediate data copies.
Method: This paper proposes Bauplan, a lakehouse architecture featuring a novel kernel-level de-anonymization module that bypasses user-space memory mapping constraints, enabling end-to-end zero-copy transmission of Arrow data across DAG nodes.
Contribution/Results: Bauplan achieves near 100% zero-copy rate in Arrow pipelines—eliminating all intermediate data copies for the first time. Experiments across diverse representative workloads show significant reductions in data transfer latency, over 98% lower memory bandwidth consumption, and up to 3.2× higher end-to-end throughput. By addressing a fundamental systems-level bottleneck, Bauplan delivers a practical, high-performance solution for serverless data processing.
📝 Abstract
Bauplan is a FaaS-based lakehouse specifically built for data pipelines: its execution engine uses Apache Arrow for data passing between the nodes in the DAG. While Arrow is known as the"zero copy format", in practice, limited Linux kernel support for shared memory makes it difficult to avoid copying entirely. In this work, we introduce several new techniques to eliminate nearly all copying from pipelines: in particular, we implement a new kernel module that performs de-anonymization, thus eliminating a copy to intermediate data. We conclude by sharing our preliminary evaluation on different workloads types, as well as discussing our plan for future improvements.