Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs

📅 2026-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scalability limitations of traditional kernel k-means clustering, which is constrained by single-GPU memory and struggles with million-scale datasets. To overcome this bottleneck, we propose a distributed kernel k-means algorithm designed for multi-GPU systems, reformulating core computations as communication-efficient distributed linear algebra primitives. Our approach introduces an innovative combination of 1.5D communication-avoiding partitioning and structured linear algebra techniques to minimize inter-GPU data movement. Evaluated on up to 256 GPUs, the method achieves 79.7% weak-scaling efficiency and a 4.2× speedup in strong scaling. Notably, clustering time is reduced from over one hour using a single-GPU sliding-window baseline to under two seconds, dramatically enhancing both scalability and performance for large-scale data clustering.

Technology Category

Application Category

📝 Abstract
Clustering is an important tool in data analysis, with K-means being popular for its simplicity and versatility. However, it cannot handle non-linearly separable clusters. Kernel K-means addresses this limitation but requires a large kernel matrix, making it computationally and memory intensive. Prior work has accelerated Kernel K-means by formulating it using sparse linear algebra primitives and implementing it on a single GPU. However, that approach cannot run on datasets with more than approximately 80,000 samples due to limited GPU memory. In this work, we address this issue by presenting a suite of distributed-memory parallel algorithms for large-scale Kernel K-means clustering on multi-GPU systems. Our approach maps the most computationally expensive components of Kernel K-means onto communication-efficient distributed linear algebra primitives uniquely tailored for Kernel K-means, enabling highly scalable implementations that efficiently cluster million-scale datasets. Central to our work is the design of partitioning schemes that enable communication-efficient composition of the linear algebra primitives that appear in Kernel K-means. Our 1.5D algorithm consistently achieves the highest performance, enabling Kernel K-means to scale to data one to two orders of magnitude larger than previously practical. On 256 GPUs, it achieves a geometric mean weak scaling efficiency of $79.7\%$ and a geometric mean strong scaling speedup of $4.2\times$. Compared to our 1D algorithm, the 1.5D approach achieves up to a $3.6\times$ speedup on 256 GPUs and reduces clustering time from over an hour to under two seconds relative to a single-GPU sliding window implementation. Our results show that distributed algorithms designed with application-specific linear algebraic formulations can achieve substantial performance improvement.
Problem

Research questions and friction points this paper is trying to address.

Kernel K-means
large-scale clustering
GPU memory limitation
distributed-memory
communication efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Communication-Avoiding
Kernel K-means
Distributed Linear Algebra
Multi-GPU
1.5D Algorithm
🔎 Similar Papers
No similar papers found.