π€ AI Summary
To address the inherent trade-off in sparse triangular solve (SpTRSV) between low coarse-grained dataflow parallelism and poor spatial locality in fine-grained dataflow, this work presents the first medium-grained dataflow hardware accelerator. Our approach features: (1) a medium-grained DAG scheduling architecture that jointly optimizes parallelism and spatial locality; (2) a partial-sum caching mechanism to alleviate PE stalls; and (3) an intra-node edge computation reordering algorithm to enhance data reuse. Leveraging software-hardware co-design, the accelerator achieves, across 245 benchmarks, average speedups of 7.0Γ over CPUs and 5.8Γ over GPUs (up to 27.8Γ and 98.8Γ, respectively), and outperforms the state-of-the-art DPU-v2 by 2.5Γ in performance and 1.7Γ in energy efficiency.
π Abstract
Sparse triangular solve (SpTRSV) is widely used in various domains. Numerous studies have been conducted using CPUs, GPUs, and specific hardware accelerators, where dataflows can be categorized into coarse and fine granularity. Coarse dataflows offer good spatial locality but suffer from low parallelism, while fine dataflows provide high parallelism but disrupt the spatial structure, leading to increased nodes and poor data reuse. This article proposes a novel hardware accelerator for SpTRSV or SpTRSV-like directed acyclic graphs (DAGs). The accelerator implements a medium granularity dataflow through hardware-software codesign and achieves both excellent spatial locality and high parallelism. In addition, a partial sum caching mechanism is introduced to reduce the blocking frequency of processing elements (PEs), and a reordering algorithm of intranode edgesβ computation is developed to enhance data reuse. Experimental results on 245 benchmarks with node counts reaching up to 85392 demonstrate that this work achieves average performance improvements of <inline-formula> <tex-math notation="LaTeX">$7.0 imes $ </tex-math></inline-formula> (up to <inline-formula> <tex-math notation="LaTeX">$27.8 imes $ </tex-math></inline-formula>) over CPUs and <inline-formula> <tex-math notation="LaTeX">$5.8 imes $ </tex-math></inline-formula> (up to <inline-formula> <tex-math notation="LaTeX">$98.8 imes $ </tex-math></inline-formula>) over GPUs. Compared with the state-of-the-art technique (DPU-v2), this work shows a <inline-formula> <tex-math notation="LaTeX">$2.5 imes $ </tex-math></inline-formula> (up to <inline-formula> <tex-math notation="LaTeX">$5.9 imes $ </tex-math></inline-formula>) average performance improvement and <inline-formula> <tex-math notation="LaTeX">$1.7 imes $ </tex-math></inline-formula> (up to <inline-formula> <tex-math notation="LaTeX">$4.1 imes $ </tex-math></inline-formula>) average energy efficiency enhancement.