Pruning in Snowflake: Working Smarter, Not Harder

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing predicate-driven partition pruning techniques struggle to support critical operators such as LIMIT, top-k, and JOIN, severely limiting analytical efficiency over large-scale cloud data. This paper introduces the first partition pruning paradigm explicitly designed for LIMIT/top-k/JOIN workloads. Leveraging high-selectivity query characteristics, we propose a dynamic pruning algorithm that exploits min/max metadata to enable runtime, adaptive skipping of irrelevant partitions. Our approach is deeply integrated with modern data lake formats—including Apache Iceberg—and co-optimized with Snowflake’s micro-partition execution engine. Experiments on Snowflake’s production environment demonstrate a 99.4% reduction in micro-partitions processed, yielding substantial improvements in throughput and latency for complex analytical queries. This work advances benchmark design toward realism, significantly extending both the technical boundaries and practical applicability of partition pruning.

Technology Category

Application Category

📝 Abstract

Modern cloud-based data analytics systems must efficiently process petabytes of data residing on cloud storage. A key optimization technique in state-of-the-art systems like Snowflake is partition pruning - skipping chunks of data that do not contain relevant information for computing query results. While partition pruning based on query predicates is a well-established technique, we present new pruning techniques that extend the scope of partition pruning to LIMIT, top-k, and JOIN operations, significantly expanding the opportunities for pruning across diverse query types. We detail the implementation of each method and examine their impact on real-world workloads. Our analysis of Snowflake's production workloads reveals that real-world analytical queries exhibit much higher selectivity than commonly assumed, yielding effective partition pruning and highlighting the need for more realistic benchmarks. We show that we can harness high selectivity by utilizing min/max metadata available in modern data analytics systems and data lake formats like Apache Iceberg, reducing the number of processed micro-partitions by 99.4% across the Snowflake data platform.

Problem

Research questions and friction points this paper is trying to address.

Extend partition pruning to LIMIT, top-k, and JOIN operations

Improve efficiency by leveraging high selectivity in analytical queries

Reduce processed micro-partitions using min/max metadata in cloud systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends pruning to LIMIT, top-k, JOIN operations

Uses min/max metadata for high selectivity pruning

Reduces processed micro-partitions by 99.4%

🔎 Similar Papers

Color: A Framework for Applying Graph Coloring to Subgraph Cardinality Estimation