🤖 AI Summary
Traditional PAC-DB privacy mechanisms rely on repeated random sampling queries, resulting in poor efficiency and limited practicality. This work proposes a novel paradigm that leverages individual bits of primary-key hashes as subsample membership identifiers, enabling privacy-preserving aggregation within a single query. Performance is further enhanced through SIMD parallelization, hash-encoded subsampling, SQL rewriting, and DuckDB extensions. For the first time, this approach replaces 128 independent randomized executions with a single query. Evaluated across thousands of queries on TPC-H, ClickBench, and SQLStorm benchmarks, the method achieves up to a 40× speedup, substantially improving the efficiency, practicality, and deployability of private database systems.
📝 Abstract
This work presents a highly optimized implementation of PAC-DB, a recent and promising database privacy model. We prove that our SIMD-PAC-DB can compute the same privatized answer with just a single query, instead of the 128 stochastic executions against different 50% database sub-samples needed by the original PAC-DB. Our key insight is that every bit of a hashed primary key can be seen to represent membership of such a sub-sample. We present new algorithms for approximate computation of stochastic aggregates based on these hashes, which, thanks to their SIMD-friendliness, run up to 40x faster than scalar equivalents. We release an open-source DuckDB community extension which includes a rewriter that PAC-privatizes arbitrary SQL queries. Our experiments on TPC-H, Clickbench, and SQLStorm evaluate thousands of queries in terms of performance and utility, significantly advancing the ease of use and functionality of privacy-aware data systems in practice.