🤖 AI Summary
This work addresses the inefficiency of row-wise Top-K selection on GPUs—a bottleneck in information retrieval, large-scale data processing, and graph neural network (GNN) training. We propose RTop-K, an ultra-fast parallel algorithm built upon a novel GPU-optimized binary-search-based framework for row-wise Top-K selection. RTop-K integrates a dynamic early-stopping mechanism with memory-hierarchy-aware scheduling, achieving substantial latency reduction without compromising numerical precision. Theoretical analysis and empirical evaluation validate the effectiveness of the early-stopping strategy. Notably, RTop-K enables, for the first time, end-to-end acceleration of MaxK-GNN training. Compared to state-of-the-art methods, RTop-K achieves up to 11.49× speedup (with early stopping enabled) and accelerates MaxK-GNN training by 11.97%–33.29%, while preserving test accuracy—zero loss in classification performance.
📝 Abstract
Top-k selection algorithms are fundamental in a wide range of applications, including high-performance computing, information retrieval, big data processing, and neural network model training. In this paper, we present RTop-K, a highly efficient parallel row-wise top-k selection algorithm specifically designed for GPUs. RTop-K leverages a binary search-based approach to optimize row-wise top-k selection, providing a scalable and accelerated solution. We conduct a detailed analysis of early stopping in our algorithm, showing that it effectively maintains the testing accuracy of neural network models while substantially improving performance. Our GPU implementation of RTop-K demonstrates superior performance over state-of-the-art row-wise top-k GPU implementations, achieving an average speed-up of up to 11.49$ imes$ with early stopping and 7.29$ imes$ without early stopping. Moreover, RTop-K accelerates the overall training workflow of MaxK-GNNs, delivering speed-ups ranging from 11.97% to 33.29% across different models and datasets.