π€ AI Summary
This study addresses the limited scalability of quantum transport simulations for nanoscale multi-terminal transistors, which is hindered by the serial nature of existing algorithms, their reliance on block tridiagonal matrix assumptions, and shared-memory parallelization. To overcome these limitations, this work proposes a distributed GPU-parallel algorithm that, for the first time, concurrently parallelizes selected inversion within the recursive Greenβs function (RGF) framework and the solution of quadratic matrix equations. The method supports a more general arrowhead block tridiagonal matrix structure, enabling efficient simulation of multi-terminal devices. Demonstrated on a nanoribbon transistor, the non-equilibrium Greenβs function (NEGF) simulation achieves a 5.2Γ speedup over the PARDISO-based selected inversion module using 16 GPUs and scales to device structures 16 times longer.
π Abstract
Driven by Moore's Law, the dimensions of transistors have been pushed down to the nanometer scale. Advanced quantum transport (QT) solvers are required to accurately simulate such nano-devices. The non-equilibrium Green's function (NEGF) formalism lends itself optimally to these tasks, but it is computationally very intensive, involving the selected inversion (SI) of matrices and the selected solution of quadratic matrix (SQ) equations. Existing algorithms to tackle these numerical problems are ideally suited to GPU acceleration, e.g., the so-called recursive Green's function (RGF) technique, but they are typically sequential, require block-tridiagonal (BT) matrices as inputs, and their implementation has been so far restricted to shared memory parallelism, thus limiting the achievable device sizes. To address these shortcomings, we introduce distributed methods that build on RGF and enable parallel selected inversion and selected solution of the quadratic matrix equation. We further extend them to handle BT matrices with arrowhead, which allows for the investigation of multi-terminal transistor structures. We evaluate the performance of our approach on a real dataset from the QT simulation of a nano-ribbon transistor and compare it with the sparse direct package PARDISO. When scaling to 16 GPUs, our fused SI and SQ solver is 5.2x faster than the SI module of PARDISO applied to a device 16x shorter. These results highlight the potential of our method to accelerate NEGF-based nano-device simulations.