π€ AI Summary
This work addresses the challenge of maintaining efficient query pruning in dynamic cloud environments, where traditional auto-clustering approaches struggle to adapt to continuous data ingestion and evolving query workloads. The authors propose a novel method that decouples reclustering strategy from clustering key selection and introduces, for the first time, the concept of βboundary micro-partitions.β By leveraging workload-aware analysis, the approach identifies micro-partitions most critical to pruning efficiency and applies incremental reclustering only to those. Built upon metadata mechanisms involving micro-partitions and zonemaps, the proposed WAIR algorithm achieves query performance close to that of fully sorted layouts while substantially reducing reclustering overhead, with a provable theoretical upper bound on cost. Experimental results demonstrate that WAIR consistently outperforms existing solutions across TPC-H, DSB, and real-world workloads.
π Abstract
Modern cloud data warehouses store data in micro-partitions and rely on metadata (e.g., zonemaps) for efficient data pruning during query processing. Maintaining data clustering in a large-scale table is crucial for effective data pruning. Existing automatic clustering approaches lack the flexibility required in dynamic cloud environments with continuous data ingestion and evolving workloads. This paper advocates a clean separation between reclustering policy and clustering-key selection. We introduce the concept of boundary micro-partitions that sit on the boundary of query ranges. We then present WAIR, a workload-aware algorithm to identify and recluster only boundary micro-partitions most critical for pruning efficiency. WAIR achieves near-optimal (with respect to fully sorted table layouts) query performance but incurs significantly lower reclustering cost with a theoretical upper bound. We further implement the algorithm into a prototype reclustering service and evaluate on standard benchmarks (TPC-H, DSB) and a real-world workload. Results show that WAIR improves query performance and reduces the overall cost compared to existing solutions.