🤖 AI Summary
This work addresses the challenge of performing analytical join queries over unstructured data, where pairwise execution of machine learning models incurs prohibitive computational costs and existing approaches struggle to balance efficiency with statistical reliability. To overcome this, the paper proposes Blocking-aware Sampling (BaS), a novel method that partitions the cross-product space into high-value and low-value regions, applying embedding-based blocking and statistical sampling strategies respectively. BaS further introduces a two-stage adaptive budget allocation algorithm to minimize estimation error. By integrating embedding representations, blocking techniques, and approximate query processing, BaS achieves substantial efficiency gains on multimodal real-world datasets while providing valid confidence intervals. Experimental results demonstrate that BaS reduces estimation error by up to 19× compared to state-of-the-art baselines.
📝 Abstract
Analytical join queries over unstructured data are increasingly prevalent in data analytics. Applying machine learning (ML) models to label every pair in the cross product of tables can achieve state-of-the-art accuracy, but the cost of pairwise execution of ML models is prohibitive. Existing algorithms, such as embedding-based blocking and sampling, aim to reduce this cost. However, they either fail to provide statistical guarantees (leading to errors up to 79% higher than expected) or become as inefficient as uniform sampling.
We propose blocking-augmented sampling (BaS), which simultaneously achieves statistical guarantees and high efficiency. BaS optimally orchestrates embedding-based blocking and sampling to mitigate their respective limitations. Specifically, BaS allocates data tuples in the cross product into two regimes based on the failure modes of embeddings. In the regime of false negatives, BaS uses sampling to estimate the result. In the regime of false positives, BaS applies embedding-based blocking to improve efficiency. To minimize the estimation error given a budget for ML executions, we design a novel two-stage algorithm that adaptively allocates the budget between blocking and sampling. Theoretically, we prove that BaS asymptotically outperforms or matches standalone sampling. On real-world datasets across different modalities, we show that BaS provides valid confidence intervals and reduces estimation errors by up to 19$\times$, compared to state-of-the-art baselines.