🤖 AI Summary
Composable data lakehouse (DLH) workloads in Function-as-a-Service (FaaS) environments pose significant challenges for scheduling modeling and high-cost empirical evaluation.
Method: This paper introduces the first lightweight, deterministic FaaS scheduling simulation framework tailored to lakehouse scenarios. Built upon an event-driven simulation paradigm, it features an abstracted FaaS execution model, a pluggable scheduling policy interface, and fine-grained lakehouse workload modeling capabilities.
Contribution/Results: Compared to conventional cloud-based simulation approaches, our framework substantially reduces algorithm iteration and infrastructure adaptation overhead. It enables efficient, reproducible scheduling evaluation of diverse real-world lakehouse tasks—including ETL, ad-hoc queries, and streaming-batch hybrid jobs—within a unified function runtime. Its core innovation lies in the first deep integration of deterministic simulation with lakehouse-specific scheduling requirements, establishing an extensible, principled validation baseline for scheduling research in cloud-native data systems.
📝 Abstract
Due to the variety of its target use cases and the large API surface area to cover, a data lakehouse (DLH) is a natural candidate for a composable data system. Bauplan is a composable DLH built on"spare data parts"and a unified Function-as-a-Service (FaaS) runtime for SQL queries and Python pipelines. While FaaS simplifies both building and using the system, it introduces novel challenges in scheduling and optimization of data workloads. In this work, starting from the programming model of the composable DLH, we characterize the underlying scheduling problem and motivate simulations as an effective tools to iterate on the DLH. We then introduce and release to the community Eudoxia, a deterministic simulator for scheduling data workloads as cloud functions. We show that Eudoxia can simulate a wide range of workloads and enables highly customizable user implementations of scheduling algorithms, providing a cheap mechanism for developers to evaluate different scheduling algorithms against their infrastructure.