🤖 AI Summary
This work addresses the high communication overhead and lack of efficient parallel strategies in sketching with random dense matrices in distributed-memory environments. It establishes, for the first time, a communication complexity lower bound for this problem, revealing that zero communication is achievable when using a small number of processors, and extends this theoretical insight to the Nyström approximation setting. Building on these results, the authors propose a novel parallel algorithm whose communication cost matches the established lower bound. Implemented on CPU/GPU heterogeneous supercomputing platforms, the method demonstrates excellent strong and weak scalability, with empirical communication costs approaching the theoretical limit, thereby confirming its efficiency and practicality.
📝 Abstract
Sketching is widely used in randomized linear algebra for low-rank matrix approximation, column subset selection, and many other problems, and it has gained significant traction in machine learning applications. However, sketching large matrices often necessitates distributed memory algorithms, where communication overhead becomes a critical bottleneck on modern supercomputing clusters. Despite its growing relevance, distributed-memory parallel strategies for sketching remain largely unexplored. In this work, we establish communication lower bounds for sketching using dense matrices that determine how much data movement is required to perform it in parallel. One important observation of our lower bounds is that no communication is required for a small number of processors. We show that our lower bounds are tight by presenting communication optimal algorithms. Furthermore, we extend our approach to determine communication lower bounds for computations of Nyström approximation where sketching is applied twice. We also introduce novel parallel algorithms whose communication costs are close to the lower bounds. Finally, we implement our algorithms on modern state-of-the-art supercomputing infrastructures which have both CPU- and GPU-equipped systems and demonstrate their parallel scalability.