One Join Order Does Not Fit All: Reducing Intermediate Results with Per-Split Query Plans

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

137K/year

🤖 AI Summary

To address intermediate result explosion in multi-way joins—especially cyclic ones—this paper introduces SplitJoin, a novel framework that elevates *split* to a first-class query operator. It employs threshold-based dynamic data partitioning to divide each relation into “heavy” and “light” parts, then tailors join orders and execution plans per partition. This breaks the conventional single-plan paradigm, enabling data-distribution-aware adaptive optimization. Technically, SplitJoin integrates heavy-light partitioning, split-aware join ordering, and system-level implementation in both DuckDB and Umbra (via its frontend). Experiments show that, on DuckDB, SplitJoin successfully executes 43 queries (vs. 29 baseline), achieving 2.1× average speedup and reducing intermediate results by 7.9×. On Umbra, it executes 45 queries (vs. 35 baseline), delivering 1.3× average speedup and 1.2× reduction in intermediate results. The work thus advances join optimization by unifying logical partitioning, plan specialization, and practical database engine integration.

Technology Category

Application Category

📝 Abstract

Minimizing intermediate results is critical for efficient multi-join query processing. Although the seminal Yannakakis algorithm offers strong guarantees for acyclic queries, cyclic queries remain an open challenge. In this paper, we propose SplitJoin, a framework that introduces split as a first-class query operator. By partitioning input tables into heavy and light parts, SplitJoin allows different data partitions to use distinct query plans, with the goal of reducing intermediate sizes using existing binary join engines. We systematically explore the design space for split-based optimizations, including threshold selection, split strategies, and join ordering after splits. Implemented as a front-end to DuckDB and Umbra, SplitJoin achieves substantial improvements: on DuckDB, SplitJoin completes 43 social network queries (vs. 29 natively), achieving 2.1x faster runtime and 7.9x smaller intermediates on average (up to 13.6x and 74x, respectively); on Umbra, it completes 45 queries (vs. 35), achieving 1.3x speedups and 1.2x smaller intermediates on average (up to 6.1x and 2.1x, respectively).

Problem

Research questions and friction points this paper is trying to address.

Reducing intermediate results in multi-join query processing

Optimizing cyclic queries beyond Yannakakis algorithm guarantees

Enabling per-split query plans for efficient binary join execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces split as a first-class query operator

Partitions tables into heavy and light parts

Uses distinct query plans per data partition

🔎 Similar Papers

Is Table Retrieval a Solved Problem? Exploring Join-Aware Multi-Table Retrieval