🤖 AI Summary
This work addresses the challenge that user-defined functions (UDFs) in modern data pipelines impede traditional predicate pushdown optimizations. The authors propose the first semantic framework for predicate pushdown that supports stateful fold computations, formalizing filtering semantics through bisimulation invariants. They design an optimal decomposition algorithm that automatically synthesizes a combination of pre-filtering and residual post-filtering components. By integrating formal verification with program transformation, the approach enables safe and efficient UDF optimization on both pandas and Spark. Evaluated on 150 real-world data pipelines, the method achieves an average speedup of 2.4×, with peak improvements reaching two orders of magnitude, while maintaining a median synthesis time of only 1.6 seconds.
📝 Abstract
Predicate pushdown is a long-standing performance optimization that filters data as early as possible in a computational workflow. In modern data pipelines, this transformation is especially important because much of the computation occurs inside \emph{user-defined functions (UDFs)} written in general-purpose languages such as Python and Scala. These UDFs capture rich domain logic and complex aggregations and are among the most expensive operations in a pipeline. Moving filters ahead of such UDFs can yield substantial performance gains, but doing so requires \emph{semantic} reasoning. This paper introduces a general semantic foundation for predicate pushdown over stateful fold-based computations.
We view pushdown as a correspondence between two programs that process different subsets of input data, with correctness witnessed by a \emph{bisimulation invariant} relating their internal states. Building on this foundation, we develop a sound and relatively complete framework for verification, alongside a synthesis algorithm that automatically constructs \emph{optimal pushdown decompositions} by finding the strongest admissible pre-filters and weakest residual post-filters. We implement this approach in a tool called Pusharoo and evaluate it on 150 real-world pandas and Spark data-processing pipelines. Our evaluation shows that Pusharoo is significantly more expressive than prior work, producing optimal pushdown transformations with a median synthesis time of 1.6 seconds per benchmark. Furthermore, our experiments demonstrate that the discovered pushdown optimizations speed up end-to-end execution by an average of 2.4$\times$ and up to two orders of magnitude.