Terabyte-Scale Analytics in the Blink of an Eye

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

This work addresses CPU architectural bottlenecks in TB-scale analytical workloads by exploring the theoretical and practical limits of GPU cluster acceleration for distributed SQL query processing. We propose the first GPU-native distributed SQL execution framework tailored for OLAP workloads, innovatively integrating NCCL collective communication primitives with asynchronous RDMA-based data movement, and introducing a columnar memory layout coupled with query-level pipelined scheduling. We provide the first systematic scalability evaluation of GPU clusters under analytical workloads, establishing an upper bound of over 60× end-to-end performance improvement. On the full TPC-H 1 TB benchmark—comprising all 22 queries—we achieve sub-400 ms completion time (roughly two human eye blinks) and attain 60× higher throughput than an equivalently configured CPU cluster. This work delivers a foundational architectural paradigm and empirical benchmark for GPU-native distributed analytical databases.

Technology Category

Application Category

📝 Abstract

For the past two decades, the DB community has devoted substantial research to take advantage of cheap clusters of machines for distributed data analytics -- we believe that we are at the beginning of a paradigm shift. The scaling laws and popularity of AI models lead to the deployment of incredibly powerful GPU clusters in commercial data centers. Compared to CPU-only solutions, these clusters deliver impressive improvements in per-node compute, memory bandwidth, and inter-node interconnect performance. In this paper, we study the problem of scaling analytical SQL queries on distributed clusters of GPUs, with the stated goal of establishing an upper bound on the likely performance gains. To do so, we build a prototype designed to maximize performance by leveraging ML/HPC best practices, such as group communication primitives for cross-device data movements. This allows us to conduct thorough performance experimentation to point our community towards a massive performance opportunity of at least 60$ imes$. To make these gains more relatable, before you can blink twice, our system can run all 22 queries of TPC-H at a 1TB scale factor!

Problem

Research questions and friction points this paper is trying to address.

Scaling analytical SQL queries on GPU clusters

Establishing upper bounds on performance gains

Leveraging ML/HPC practices for data movement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging GPU clusters for distributed SQL analytics

Using ML/HPC best practices for performance maximization

Implementing group communication primitives for data movement

🔎 Similar Papers

Sorting-based FPGA Sliding Window Aggregation Engine without off-chip Memories