🤖 AI Summary
Real-world SQL workloads are difficult to obtain due to privacy constraints, and existing anonymized performance traces lack original queries and underlying data, resulting in low-fidelity synthetic workloads. To address this, this work proposes ResQ, a system that generates high-fidelity, executable SQL workloads using only publicly available performance traces. ResQ integrates execution-aware query graph modeling, Bayesian optimization–driven predicate search, a lightweight local cost model, and a multi-level query reuse mechanism. It is the first approach capable of accurately reconstructing per-query execution metrics and operator distributions. Evaluated on industrial datasets—including Snowset, Redset, and the newly released Bendset—ResQ significantly outperforms existing methods, reducing token usage by 96.71%, runtime by 86.97%, maximum CPU-time Q-error by 14.8×, and scanned bytes by up to 997.7×, while closely matching the target operator composition.
📝 Abstract
Database research and development rely heavily on realistic user workloads for benchmarking, instance optimization, migration testing, and database tuning. However, acquiring real-world SQL queries is notoriously challenging due to strict privacy regulations. While cloud database vendors have begun releasing anonymized performance traces to the research community, these traces typi- cally provide only high-level execution statistics without the origi- nal query text or data, which is insufficient for scenarios that require actual execution. Existing tools fail to capture fine-grained perfor- mance patterns or generate runnable workloads that reproduce these public traces with both high fidelity and efficiency. To bridge this gap, we propose ResQ, a fine-grained workload synthesis sys- tem designed to generate executable SQL workloads that faithfully match the per-query execution targets and operator distributions of production traces. ResQ constructs execution-aware query graphs, instantiates them into SQL via Bayesian Optimization-driven pred- icate search, and explicitly models workload repetition through reuse at both exact-query and parameterized-template levels. To ensure practical scalability, ResQ combines search-space bounding with lightweight local cost models to accelerate optimization. Ex- periments on public cloud traces (Snowset, Redset) and a newly released industrial trace (Bendset) demonstrate that ResQ signif- icantly outperforms state-of-the-art baselines, achieving 96.71% token savings and a 86.97% reduction in runtime, while lowering maximum Q-error by 14.8x on CPU time and 997.7x on scanned bytes, and closely matching operator composition.