RadiK: Scalable and Optimized GPU-Parallel Radix Top-K Selection

📅 2024-05-30
🏛️ International Conference on Supercomputing
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing GPU-based Top-k algorithms are constrained by on-chip memory capacity, limiting scalability to large k values and hindering applicability in database systems and deep learning. This paper introduces the first GPU-parallel Top-k framework built upon optimized radix sort, co-designed for high memory bandwidth utilization and resource efficiency—thereby eliminating the traditional on-chip memory bottleneck on k. It supports arbitrary input lengths and batch sizes while maintaining high throughput for large-k selection. Key innovations include: (1) customized memory-access optimization to maximize DRAM bandwidth; (2) an input-aware adaptive scaling strategy that dynamically adjusts computational granularity; and (3) a batch-coordinated scheduling mechanism to balance load across SMs. Experiments demonstrate 2.5× speedup over state-of-the-art methods for single-query workloads and 4.8× for batched queries. Under adversarial data distributions, performance further improves by up to 2.7×.

Technology Category

Application Category

📝 Abstract
Top-k selection, which identifies the largest or smallest k elements from a data set, is a fundamental operation in data-intensive domains such as databases and deep learning, so its scalability and efficiency are critical for these high-performance systems. However, previous studies on its efficient GPU implementation are mostly merge-based and rely heavily on fast but size-limited on-chip memory, thereby limiting scalability with a restricted upper bound on k. This work introduces RadiK, a scalable and optimized GPU-parallel radix top-k selection that supports significantly larger k values than existing methods without compromising efficiency, regardless of input length and batch size. RadiK incorporates a novel optimization framework tailored for high memory bandwidth and resource utilization, achieving up to 2.5 × speedup over the prior art for non-batch queries and up to 4.8 × speedup for batch queries. In addition, we propose an adaptive scaling technique that strengthens robustness, which further provides up to 2.7 × speedup on highly adversarial input distributions.
Problem

Research questions and friction points this paper is trying to address.

GPU Top-k Selection
Memory Limitation
Big Data Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Top-K Optimization
GPU Acceleration
Performance Tuning
Y
Yifei Li
Alibaba Group
B
Bole Zhou
Independent
J
Jiejing Zhang
Alibaba Group
Xuechao Wei
Xuechao Wei
HYGON
Computer Architecture
Y
Yinghan Li
Alibaba Group
Yingda Chen
Yingda Chen
Alibaba Group, Microsoft