Theseus: A Distributed and Scalable GPU-Accelerated Query Processing Platform Optimized for Efficient Data Movement

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high data movement overhead, severe memory fragmentation, and insufficient hardware-software co-design in multi-terabyte OLAP query processing under GPU acceleration, this paper proposes a distributed GPU-accelerated query processing framework. Our method introduces a dedicated asynchronous control mechanism that tightly couples GPU computation, network communication, multi-level storage spilling, and data prefetching; it also employs a fixed-page host memory locking allocation strategy to optimize cross-layer data transfer and memory utilization. Evaluation on cloud-based TPC-H benchmarks (scale factors 1K–30K) shows our framework achieves up to 4× higher performance than Databricks Photon. Moreover, it completes full-scale TPC-H and TPC-DS queries on 100 TB datasets using only two NVIDIA DGX A100 nodes—demonstrating substantial throughput improvement and significant reduction in query execution cost.

Technology Category

Application Category

📝 Abstract
Online analytical processing of queries on datasets in the many-terabyte range is only possible with costly distributed computing systems. To decrease the cost and increase the throughput, systems can leverage accelerators such as GPUs, which are now ubiquitous in the compute infrastructure. This introduces many challenges, the majority of which are related to when, where, and how to best move data around the system. We present Theseus -- a production-ready enterprise-scale distributed accelerator-native query engine designed to balance data movement, memory utilization, and computation in an accelerator-based system context. Specialized asynchronous control mechanisms are tightly coupled to the hardware resources for the purpose of network communication, data pre-loading, data spilling across memories and storage, and GPU compute tasks. The memory subsystem contains a mechanism for fixed-size page-locked host memory allocations to increase throughput and reduce memory fragmentation. For the TPC-H benchmarks at scale factors ranging from 1k to 30k on cloud infrastructure, Theseus outperforms Databricks Photon by up to $4 imes$ at cost parity. Theseus is capable of processing all queries of the TPC-H and TPC-DS benchmarks at scale factor 100k (100 TB scale) with as few as 2 DGX A100 640GB nodes.
Problem

Research questions and friction points this paper is trying to address.

Optimizing data movement in distributed GPU-accelerated query processing
Balancing memory utilization and computation in accelerator-based systems
Scaling analytical queries efficiently for terabyte-range datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed accelerator-native query engine design
Specialized asynchronous control mechanisms
Fixed-size page-locked host memory allocations
🔎 Similar Papers
No similar papers found.
F
Felipe Aramburú
Voltron Data
W
William Malpica
Voltron Data
K
Kaouther Abrougui
Voltron Data
Amin Aramoon
Amin Aramoon
Voltron Data
R
Romulo Auccapuclla
Voltron Data
C
Claude Brisson
Voltron Data
M
Matthijs Brobbel
Voltron Data
C
Colby Farrell
Voltron Data
P
Pradeep Garigipati
Voltron Data
Joost Hoozemans
Joost Hoozemans
Delft University of Technology
Computer EngineeringVLIWembedded systemsImage Processing
S
Supun Kamburugamuve
Voltron Data
A
Akhil Nair
Voltron Data
A
Alexander Ocsa
Voltron Data
J
Johan Peltenburg
Voltron Data
R
Rubén Quesada López
Voltron Data
D
Deepak Sihag
Voltron Data
Ahmet Uyar
Ahmet Uyar
Remote Post-Doctoral Researcher at Digital Science Center - Indiana University, Bloomington, IN, USA
Big DataCloud ComputingWeb SearchVoIP
D
Dhruv Vats
Voltron Data
M
Michael Wendt
Voltron Data
Jignesh M. Patel
Jignesh M. Patel
Carnegie Mellon University
Database SystemsData Management
R
Rodrigo Aramburú
Voltron Data