Workload-Aware Incremental Reclustering in Cloud Data Warehouses

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the challenge of maintaining efficient query pruning in dynamic cloud environments, where traditional auto-clustering approaches struggle to adapt to continuous data ingestion and evolving query workloads. The authors propose a novel method that decouples reclustering strategy from clustering key selection and introduces, for the first time, the concept of “boundary micro-partitions.” By leveraging workload-aware analysis, the approach identifies micro-partitions most critical to pruning efficiency and applies incremental reclustering only to those. Built upon metadata mechanisms involving micro-partitions and zonemaps, the proposed WAIR algorithm achieves query performance close to that of fully sorted layouts while substantially reducing reclustering overhead, with a provable theoretical upper bound on cost. Experimental results demonstrate that WAIR consistently outperforms existing solutions across TPC-H, DSB, and real-world workloads.

Technology Category

Application Category

📝 Abstract

Modern cloud data warehouses store data in micro-partitions and rely on metadata (e.g., zonemaps) for efficient data pruning during query processing. Maintaining data clustering in a large-scale table is crucial for effective data pruning. Existing automatic clustering approaches lack the flexibility required in dynamic cloud environments with continuous data ingestion and evolving workloads. This paper advocates a clean separation between reclustering policy and clustering-key selection. We introduce the concept of boundary micro-partitions that sit on the boundary of query ranges. We then present WAIR, a workload-aware algorithm to identify and recluster only boundary micro-partitions most critical for pruning efficiency. WAIR achieves near-optimal (with respect to fully sorted table layouts) query performance but incurs significantly lower reclustering cost with a theoretical upper bound. We further implement the algorithm into a prototype reclustering service and evaluate on standard benchmarks (TPC-H, DSB) and a real-world workload. Results show that WAIR improves query performance and reduces the overall cost compared to existing solutions.

Problem

Research questions and friction points this paper is trying to address.

cloud data warehouses

data clustering

workload evolution

data pruning

micro-partitions

Innovation

Methods, ideas, or system contributions that make the work stand out.

workload-aware

incremental reclustering

boundary micro-partitions