🤖 AI Summary
Pandas struggles with ultra-large-scale data due to memory constraints, and migrating to scalable frameworks (e.g., Dask, Modin) requires extensive code rewriting. Method: This paper introduces the “Lazy Fat Panda Diet” paradigm—a JIT-based static analysis and lazy-loading DataFrame wrapper that enables seamless, API-compatible, backend-agnostic optimization across Pandas, Dask, and Modin—without modifying user code. The system supports lazy execution and runtime memory-aware query planning via just two lines of code. Contribution/Results: Evaluated on real-world data science workloads, our approach achieves 2.1–5.8× end-to-end speedup over baseline Pandas, maintains correctness and stability even under memory pressure, and eliminates costly framework migration. It significantly reduces development and deployment overhead for large-scale analytics while preserving full Pandas compatibility.
📝 Abstract
Pandas is widely used for data science applications, but users often run into problems when datasets are larger than memory. There are several frameworks based on lazy evaluation that handle large datasets, but the programs have to be rewritten to suit the framework, and the presence of multiple frameworks complicates the life of a programmer. In this paper we present a framework that allows programmers to code in plain Pandas; with just two lines of code changed by the user, our system optimizes the program using a combination of just-in-time static analysis, and runtime optimization based on a lazy dataframe wrapper framework. Moreover, our system allows the programmer to choose the backend. It works seamlessly with Pandas, Dask, and Modin, allowing the choice of the best-suited backend for an application based on factors such as data size. Performance results on a variety of programs show the benefits of our framework.