Optimal Predicate Pushdown Synthesis

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenge that user-defined functions (UDFs) in modern data pipelines impede traditional predicate pushdown optimizations. The authors propose the first semantic framework for predicate pushdown that supports stateful fold computations, formalizing filtering semantics through bisimulation invariants. They design an optimal decomposition algorithm that automatically synthesizes a combination of pre-filtering and residual post-filtering components. By integrating formal verification with program transformation, the approach enables safe and efficient UDF optimization on both pandas and Spark. Evaluated on 150 real-world data pipelines, the method achieves an average speedup of 2.4×, with peak improvements reaching two orders of magnitude, while maintaining a median synthesis time of only 1.6 seconds.

Technology Category

Application Category

📝 Abstract

Predicate pushdown is a long-standing performance optimization that filters data as early as possible in a computational workflow. In modern data pipelines, this transformation is especially important because much of the computation occurs inside \emph{user-defined functions (UDFs)} written in general-purpose languages such as Python and Scala. These UDFs capture rich domain logic and complex aggregations and are among the most expensive operations in a pipeline. Moving filters ahead of such UDFs can yield substantial performance gains, but doing so requires \emph{semantic} reasoning. This paper introduces a general semantic foundation for predicate pushdown over stateful fold-based computations. We view pushdown as a correspondence between two programs that process different subsets of input data, with correctness witnessed by a \emph{bisimulation invariant} relating their internal states. Building on this foundation, we develop a sound and relatively complete framework for verification, alongside a synthesis algorithm that automatically constructs \emph{optimal pushdown decompositions} by finding the strongest admissible pre-filters and weakest residual post-filters. We implement this approach in a tool called Pusharoo and evaluate it on 150 real-world pandas and Spark data-processing pipelines. Our evaluation shows that Pusharoo is significantly more expressive than prior work, producing optimal pushdown transformations with a median synthesis time of 1.6 seconds per benchmark. Furthermore, our experiments demonstrate that the discovered pushdown optimizations speed up end-to-end execution by an average of 2.4$\times$ and up to two orders of magnitude.

Problem

Research questions and friction points this paper is trying to address.

predicate pushdown

user-defined functions

fold-based computations

data pipelines

semantic reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

predicate pushdown

bisimulation invariant

optimal decomposition