🤖 AI Summary
To address intermediate result explosion in multi-way joins—especially cyclic ones—this paper introduces SplitJoin, a novel framework that elevates *split* to a first-class query operator. It employs threshold-based dynamic data partitioning to divide each relation into “heavy” and “light” parts, then tailors join orders and execution plans per partition. This breaks the conventional single-plan paradigm, enabling data-distribution-aware adaptive optimization. Technically, SplitJoin integrates heavy-light partitioning, split-aware join ordering, and system-level implementation in both DuckDB and Umbra (via its frontend). Experiments show that, on DuckDB, SplitJoin successfully executes 43 queries (vs. 29 baseline), achieving 2.1× average speedup and reducing intermediate results by 7.9×. On Umbra, it executes 45 queries (vs. 35 baseline), delivering 1.3× average speedup and 1.2× reduction in intermediate results. The work thus advances join optimization by unifying logical partitioning, plan specialization, and practical database engine integration.
📝 Abstract
Minimizing intermediate results is critical for efficient multi-join query processing. Although the seminal Yannakakis algorithm offers strong guarantees for acyclic queries, cyclic queries remain an open challenge. In this paper, we propose SplitJoin, a framework that introduces split as a first-class query operator. By partitioning input tables into heavy and light parts, SplitJoin allows different data partitions to use distinct query plans, with the goal of reducing intermediate sizes using existing binary join engines. We systematically explore the design space for split-based optimizations, including threshold selection, split strategies, and join ordering after splits. Implemented as a front-end to DuckDB and Umbra, SplitJoin achieves substantial improvements: on DuckDB, SplitJoin completes 43 social network queries (vs. 29 natively), achieving 2.1x faster runtime and 7.9x smaller intermediates on average (up to 13.6x and 74x, respectively); on Umbra, it completes 45 queries (vs. 35), achieving 1.3x speedups and 1.2x smaller intermediates on average (up to 6.1x and 2.1x, respectively).