PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks&Fast Storage

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

To address insufficient computation–I/O overlap—causing low GPU, network, and storage utilization in large-scale, storage-resident OLAP queries under GPU memory constraints—this paper introduces PystachIO, the first distributed OLAP engine built on PyTorch. Its core innovations are: (1) deep integration of the tensor computation runtime into the distributed query execution layer; (2) an RDMA-aware asynchronous I/O scheduler enabling fine-grained pipelining between NVMe storage reads and GPU kernel execution; and (3) an end-to-end co-optimization framework jointly orchestrating computation, communication, and storage. Experiments on representative OLAP workloads demonstrate that PystachIO achieves up to 3× higher end-to-end performance over state-of-the-art GPU-accelerated systems, with 2.1× higher GPU utilization, and 1.8× and 2.4× improvements in network and storage bandwidth utilization, respectively.

Technology Category

Application Category

📝 Abstract

The AI hardware boom has led modern data centers to adopt HPC-style architectures centered on distributed, GPU-centric computation. Large GPU clusters interconnected by fast RDMA networks and backed by high-bandwidth NVMe storage enable scalable computation and rapid access to storage-resident data. Tensor computation runtimes (TCRs), such as PyTorch, originally designed for AI workloads, have recently been shown to accelerate analytical workloads. However, prior work has primarily considered settings where the data fits in aggregated GPU memory. In this paper, we systematically study how TCRs can support scalable, distributed query processing for large-scale, storage-resident OLAP workloads. Although TCRs provide abstractions for network and storage I/O, naive use often underutilizes GPU and I/O bandwidth due to insufficient overlap between computation and data movement. As a core contribution, we present PystachIO, a PyTorch-based distributed OLAP engine that combines fast network and storage I/O with key optimizations to maximize GPU, network, and storage utilization. Our evaluation shows up to 3x end-to-end speedups over existing distributed GPU-based query processing approaches.

Problem

Research questions and friction points this paper is trying to address.

Optimizes distributed GPU query processing for large-scale storage-resident OLAP workloads

Addresses underutilization of GPU and I/O bandwidth in tensor computation runtimes

Enhances overlap between computation and data movement over fast networks and storage

Innovation

Methods, ideas, or system contributions that make the work stand out.

PyTorch-based distributed OLAP engine for GPU clusters

Combines fast RDMA networks with NVMe storage optimization

Maximizes GPU, network, and storage utilization via overlapping

🔎 Similar Papers

No similar papers found.