Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV

πŸ“… 2024-06-15
πŸ›οΈ IEEE Transactions on Very Large Scale Integration (VLSI) Systems
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the inherent trade-off in sparse triangular solve (SpTRSV) between low coarse-grained dataflow parallelism and poor spatial locality in fine-grained dataflow, this work presents the first medium-grained dataflow hardware accelerator. Our approach features: (1) a medium-grained DAG scheduling architecture that jointly optimizes parallelism and spatial locality; (2) a partial-sum caching mechanism to alleviate PE stalls; and (3) an intra-node edge computation reordering algorithm to enhance data reuse. Leveraging software-hardware co-design, the accelerator achieves, across 245 benchmarks, average speedups of 7.0Γ— over CPUs and 5.8Γ— over GPUs (up to 27.8Γ— and 98.8Γ—, respectively), and outperforms the state-of-the-art DPU-v2 by 2.5Γ— in performance and 1.7Γ— in energy efficiency.

Technology Category

Application Category

πŸ“ Abstract
Sparse triangular solve (SpTRSV) is widely used in various domains. Numerous studies have been conducted using CPUs, GPUs, and specific hardware accelerators, where dataflows can be categorized into coarse and fine granularity. Coarse dataflows offer good spatial locality but suffer from low parallelism, while fine dataflows provide high parallelism but disrupt the spatial structure, leading to increased nodes and poor data reuse. This article proposes a novel hardware accelerator for SpTRSV or SpTRSV-like directed acyclic graphs (DAGs). The accelerator implements a medium granularity dataflow through hardware-software codesign and achieves both excellent spatial locality and high parallelism. In addition, a partial sum caching mechanism is introduced to reduce the blocking frequency of processing elements (PEs), and a reordering algorithm of intranode edges’ computation is developed to enhance data reuse. Experimental results on 245 benchmarks with node counts reaching up to 85392 demonstrate that this work achieves average performance improvements of <inline-formula> <tex-math notation="LaTeX">$7.0 imes $ </tex-math></inline-formula> (up to <inline-formula> <tex-math notation="LaTeX">$27.8 imes $ </tex-math></inline-formula>) over CPUs and <inline-formula> <tex-math notation="LaTeX">$5.8 imes $ </tex-math></inline-formula> (up to <inline-formula> <tex-math notation="LaTeX">$98.8 imes $ </tex-math></inline-formula>) over GPUs. Compared with the state-of-the-art technique (DPU-v2), this work shows a <inline-formula> <tex-math notation="LaTeX">$2.5 imes $ </tex-math></inline-formula> (up to <inline-formula> <tex-math notation="LaTeX">$5.9 imes $ </tex-math></inline-formula>) average performance improvement and <inline-formula> <tex-math notation="LaTeX">$1.7 imes $ </tex-math></inline-formula> (up to <inline-formula> <tex-math notation="LaTeX">$4.1 imes $ </tex-math></inline-formula>) average energy efficiency enhancement.
Problem

Research questions and friction points this paper is trying to address.

Designing efficient hardware accelerator for sparse triangular solve
Balancing spatial locality and parallelism in dataflow design
Enhancing performance and energy efficiency for SpTRSV computations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Medium granularity dataflow for spatial locality and parallelism
Partial sum caching mechanism to reduce PE blocking
Reordering algorithm of intra-node edges for data reuse
πŸ”Ž Similar Papers
No similar papers found.
Q
Qian Chen
Department of National ASIC System Engineering Research Center, Southeast University, Nanjing 210096, China
X
Xiaofeng Yang
Department of National ASIC System Engineering Research Center, Southeast University, Nanjing 210096, China
S
Shengli Lu
Department of National ASIC System Engineering Research Center, Southeast University, Nanjing 210096, China