Pipelined Dense Symmetric Eigenvalue Decomposition on Multi-GPU Architectures

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing distributed libraries for symmetric eigenvalue decomposition (SEVD) achieve only ∼1.5% of peak performance on multi-GPU systems, severely limiting scalability for large-scale problems. To address this, we propose a pipelined two-stage SEVD algorithm that replaces conventional sequential workflows with fine-grained parallelism. Our method employs tile-based matrix partitioning, dynamic load balancing across GPUs, overlapping of communication and computation, and optimized data distribution strategies—collectively enhancing GPU resource utilization and scalability. Experiments on an 8×A100 platform demonstrate average speedups of 5.74× over cuSOLVERMp and 6.59× over MAGMA. Both strong and weak scaling significantly outperform baseline libraries. This work breaks the performance bottleneck of current distributed SEVD solvers and establishes a new, efficient paradigm for eigenanalysis of ultra-large symmetric matrices.

Technology Category

Application Category

📝 Abstract
Large symmetric eigenvalue problems are commonly observed in many disciplines such as Chemistry and Physics, and several libraries including cuSOLVERMp, MAGMA and ELPA support computing large eigenvalue decomposition on multi-GPU or multi-CPU-GPU hybrid architectures. However, these libraries do not provide satisfied performance that all of the libraries only utilize around 1.5% of the peak multi-GPU performance. In this paper, we propose a pipelined two-stage eigenvalue decomposition algorithm instead of conventional subsequent algorithm with substantial optimizations. On an 8$ imes$A100 platform, our implementation surpasses state-of-the-art cuSOLVERMp and MAGMA baselines, delivering mean speedups of 5.74$ imes$ and 6.59$ imes$, with better strong and weak scalability.
Problem

Research questions and friction points this paper is trying to address.

Improving multi-GPU performance for symmetric eigenvalue decomposition
Addressing low utilization of peak performance in existing libraries
Proposing optimized pipelined algorithm for faster computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pipelined two-stage eigenvalue decomposition algorithm
Substantial optimizations for multi-GPU architectures
Improved scalability and performance over existing libraries
🔎 Similar Papers
No similar papers found.
Hansheng Wang
Hansheng Wang
Guanghua School of Management, Peking University
Statistics in Business
R
Ruiyi Zhan
University of Electronic Science and Technology of China
D
Dajun Huang
University of Electronic Science and Technology of China
X
Xingchen Liu
University of Chinese Academy of Sciences
Q
Qiao Li
Xiamen University
H
Hancong Duan
University of Electronic Science and Technology of China
Dingwen Tao
Dingwen Tao
Chinese Academy of Sciences, IEEE/ACM Senior Member
High Performance ComputingData ReductionDeep LearningSystems for MLGPU
G
Guangming Tan
Institute of Computing Technology, Chinese Academy of Sciences
Shaoshuai Zhang
Shaoshuai Zhang
University of Electronic Science and Technology of China