RadiK: Scalable and Optimized GPU-Parallel Radix Top-K Selection

📅 2024-05-30

🏛️ International Conference on Supercomputing

📈 Citations: 1

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Existing GPU-based Top-k algorithms are constrained by on-chip memory capacity, limiting scalability to large k values and hindering applicability in database systems and deep learning. This paper introduces the first GPU-parallel Top-k framework built upon optimized radix sort, co-designed for high memory bandwidth utilization and resource efficiency—thereby eliminating the traditional on-chip memory bottleneck on k. It supports arbitrary input lengths and batch sizes while maintaining high throughput for large-k selection. Key innovations include: (1) customized memory-access optimization to maximize DRAM bandwidth; (2) an input-aware adaptive scaling strategy that dynamically adjusts computational granularity; and (3) a batch-coordinated scheduling mechanism to balance load across SMs. Experiments demonstrate 2.5× speedup over state-of-the-art methods for single-query workloads and 4.8× for batched queries. Under adversarial data distributions, performance further improves by up to 2.7×.

Technology Category

Application Category

📝 Abstract

Top-k selection, which identifies the largest or smallest k elements from a data set, is a fundamental operation in data-intensive domains such as databases and deep learning, so its scalability and efficiency are critical for these high-performance systems. However, previous studies on its efficient GPU implementation are mostly merge-based and rely heavily on fast but size-limited on-chip memory, thereby limiting scalability with a restricted upper bound on k. This work introduces RadiK, a scalable and optimized GPU-parallel radix top-k selection that supports significantly larger k values than existing methods without compromising efficiency, regardless of input length and batch size. RadiK incorporates a novel optimization framework tailored for high memory bandwidth and resource utilization, achieving up to 2.5 × speedup over the prior art for non-batch queries and up to 4.8 × speedup for batch queries. In addition, we propose an adaptive scaling technique that strengthens robustness, which further provides up to 2.7 × speedup on highly adversarial input distributions.

Problem

Research questions and friction points this paper is trying to address.

GPU Top-k Selection

Memory Limitation

Big Data Processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Top-K Optimization

GPU Acceleration

Performance Tuning

🔎 Similar Papers

RTop-K: Ultra-Fast Row-Wise Top-K Selection for Neural Network Acceleration on GPUs