Towards Efficient Random-Order Enumeration for Join Queries

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses uniform random-order enumeration of join query results, overcoming the lack of worst-case time guarantees in existing methods—especially for cyclic joins. We propose the first general-purpose algorithm that simultaneously achieves worst-case delay guarantees and low constant factors. Leveraging AGM-bound analysis, our approach integrates lightweight indexing, dynamic sampling, and a two-phase enumeration strategy. After O(|Q| log |Q|) preprocessing—requiring no query-specific indexing—it attains worst-case delay O(AGM(Q)/|Res(Q)| · log²|Q|), where AGM(Q) is the AGM bound and |Res(Q)| the output size. The method is compatible with standard index structures and imposes no query-dependent preprocessing overhead. Experimental evaluation demonstrates substantial improvements over state-of-the-art methods, with observed delays approaching the theoretical optimum. Our algorithm thus provides efficient, provably guaranteed random sampling for analytical workloads.

Technology Category

Application Category

📝 Abstract
In many data analysis pipelines, a basic and time-consuming process is to produce join results and feed them into downstream tasks. Numerous enumeration algorithms have been developed for this purpose. To be a statistically meaningful representation of the whole join result, the result tuples are required to be enumerated in uniformly random order. However, existing studies lack an efficient random-order enumeration algorithm with a worst-case runtime guarantee for (cyclic) join queries. In this paper, we study the problem of enumerating the results of a join query in random order. We develop an efficient random-order enumeration algorithm for join queries with no large hidden constants in its complexity, achieving expected $O(frac{mathrm{AGM}(Q)}{|Res(Q)|}log^2|Q|)$ delay, $O(mathrm{AGM}(Q)log|Q|)$ total running time after $O(|Q|log|Q|)$-time index construction, where $|Q|$ is the size of input, $mathrm{AGM}(Q)$ is the AGM bound, and $|Res(Q)|$ is the size of the join result. We prove that our algorithm is near-optimal in the worst case, under the combinatorial $k$-clique hypothesis. Our algorithm requires no query-specific preprocessing and can be flexibly adapted to many common database indexes with only minor modifications. We also devise two non-trivial techniques to speed up the enumeration, and provide an experimental study on our enumeration algorithm along with the speed-up techniques. The experimental results show that our algorithm, enhanced with the proposed techniques, significantly outperforms existing state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Efficient random-order enumeration for join queries
Worst-case runtime guarantee for cyclic joins
Near-optimal algorithm without query-specific preprocessing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient random-order enumeration algorithm
Near-optimal worst-case performance guarantee
Flexible adaptation to common database indexes
🔎 Similar Papers
No similar papers found.
P
Pengyu Chen
Harbin Institute of Technology, Harbin, China
Z
Zizheng Guo
Harbin Institute of Technology, Harbin, China
Jianwei Yang
Jianwei Yang
Research Scientist, Meta SuperIntelligence Lab
Multimodal Agentic AI
D
Dongjing Miao
Harbin Institute of Technology, Harbin, China