🤖 AI Summary
This work addresses the problem of Poisson sampling over acyclic join query results, where tuples are sampled non-uniformly according to tuple-specific probabilities. The authors propose an approximate instance-optimal algorithm that performs sampling in $O(N + k \log N)$ time, where $N$ denotes the input size and $k$ the sample size. The key innovation lies in the first integration of random-access indexing with Poisson sampling into a unified framework, enabling efficient direct access to the $i$-th join tuple without materializing the full result while remaining compatible with the classic Yannakakis algorithm. Experimental evaluation demonstrates that the proposed method significantly outperforms conventional repeated Bernoulli trials, and the index structure proves highly efficient even in column-store systems.
📝 Abstract
We introduce the problem of Poisson sampling over joins: compute a sample of the result of a join query by conceptually performing a Bernoulli trial for each join tuple, using a non-uniform and tuple-specific probability. We propose an algorithm for Poisson sampling over acyclic joins that is nearly instance-optimal, running in time O(N + k \log N) where N is the size of the input database, and k is the size of the resulting sample. Our algorithm hinges on two building blocks: (1) The construction of a random-access index that allows, given a number i, to randomly access the i-th join tuple without fully materializing the (possibly large) join result; (2) The probing of this index to construct the result sample. We study the engineering trade-offs required to make both components practical, focusing on their implementation in column stores, and identify best-performing alternatives for both. Our experiments on real-world data demonstrate that this pair of alternatives significantly outperforms the repeated-Bernoulli-trial algorithm for Poisson sampling while also demonstrating that the random-access index by itself can be used to competively implement Yannakakis' acyclic join processing algorithm when no sampling is required. This shows that, as far a query engine design is concerned, it is possible to adopt a uniform basis for both classical acyclic join processing and Poisson sampling, both without regret compared to classical join and sampling algorithms.