Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses symmetric positive-definite block-tridiagonal linear systems arising in time-dependent estimation and optimal control. We propose a GPU-accelerated parallel solver based on recursive Schur complement reduction, which hierarchically decomposes the original problem into independent, batch-processable subproblems. To maximize GPU throughput, we design customized batched BLAS/LAPACK kernels and optimize task partitioning and memory access patterns via CUDA. The resulting open-source, cross-platform solver—TBD-GPU—demonstrates substantial speedups over state-of-the-art CPU-based sparse solvers (e.g., CHOLMOD, HSL MA57) across multiple benchmark problems, while matching the performance of NVIDIA cuDSS. These results validate the effectiveness and advancement of structure-aware batched computation for solving structured dense systems on GPUs.

Technology Category

Application Category

📝 Abstract

We present a GPU implementation for the factorization and solution of block-tridiagonal symmetric positive definite linear systems, which commonly arise in time-dependent estimation and optimal control problems. Our method employs a recursive algorithm based on Schur complement reduction, transforming the system into a hierarchy of smaller, independent blocks that can be efficiently solved in parallel using batched BLAS/LAPACK routines. While batched routines have been used in sparse solvers, our approach applies these kernels in a tailored way by exploiting the block-tridiagonal structure known in advance. Performance benchmarks based on our open-source, cross-platform implementation, TBD-GPU, demonstrate the advantages of this tailored utilization: achieving substantial speed-ups compared to state-of-the-art CPU direct solvers, including CHOLMOD and HSL MA57, while remaining competitive with NVIDIA cuDSS. However, the current implementation still performs sequential calls of batched routines at each recursion level, and the block size must be sufficiently large to adequately amortize kernel launch overhead.

Problem

Research questions and friction points this paper is trying to address.

Solving block-tridiagonal symmetric positive definite linear systems

Implementing parallel GPU factorization using batched BLAS/LAPACK kernels

Optimizing performance for time-dependent estimation and control problems

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU-accelerated block-tridiagonal system solver

Recursive Schur complement reduction algorithm

Batched BLAS/LAPACK kernels parallel processing

🔎 Similar Papers

No similar papers found.