Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV

📅 2024-06-15

🏛️ IEEE Transactions on Very Large Scale Integration (VLSI) Systems

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address the inherent trade-off in sparse triangular solve (SpTRSV) between low coarse-grained dataflow parallelism and poor spatial locality in fine-grained dataflow, this work presents the first medium-grained dataflow hardware accelerator. Our approach features: (1) a medium-grained DAG scheduling architecture that jointly optimizes parallelism and spatial locality; (2) a partial-sum caching mechanism to alleviate PE stalls; and (3) an intra-node edge computation reordering algorithm to enhance data reuse. Leveraging software-hardware co-design, the accelerator achieves, across 245 benchmarks, average speedups of 7.0× over CPUs and 5.8× over GPUs (up to 27.8× and 98.8×, respectively), and outperforms the state-of-the-art DPU-v2 by 2.5× in performance and 1.7× in energy efficiency.

Technology Category

Application Category

📝 Abstract

Sparse triangular solve (SpTRSV) is widely used in various domains. Numerous studies have been conducted using CPUs, GPUs, and specific hardware accelerators, where dataflows can be categorized into coarse and fine granularity. Coarse dataflows offer good spatial locality but suffer from low parallelism, while fine dataflows provide high parallelism but disrupt the spatial structure, leading to increased nodes and poor data reuse. This article proposes a novel hardware accelerator for SpTRSV or SpTRSV-like directed acyclic graphs (DAGs). The accelerator implements a medium granularity dataflow through hardware-software codesign and achieves both excellent spatial locality and high parallelism. In addition, a partial sum caching mechanism is introduced to reduce the blocking frequency of processing elements (PEs), and a reordering algorithm of intranode edges’ computation is developed to enhance data reuse. Experimental results on 245 benchmarks with node counts reaching up to 85392 demonstrate that this work achieves average performance improvements of <inline-formula> <tex-math notation="LaTeX">$7.0 imes $ </tex-math></inline-formula> (up to <inline-formula> <tex-math notation="LaTeX">$27.8 imes $ </tex-math></inline-formula>) over CPUs and <inline-formula> <tex-math notation="LaTeX">$5.8 imes $ </tex-math></inline-formula> (up to <inline-formula> <tex-math notation="LaTeX">$98.8 imes $ </tex-math></inline-formula>) over GPUs. Compared with the state-of-the-art technique (DPU-v2), this work shows a <inline-formula> <tex-math notation="LaTeX">$2.5 imes $ </tex-math></inline-formula> (up to <inline-formula> <tex-math notation="LaTeX">$5.9 imes $ </tex-math></inline-formula>) average performance improvement and <inline-formula> <tex-math notation="LaTeX">$1.7 imes $ </tex-math></inline-formula> (up to <inline-formula> <tex-math notation="LaTeX">$4.1 imes $ </tex-math></inline-formula>) average energy efficiency enhancement.

Problem

Research questions and friction points this paper is trying to address.

Designing efficient hardware accelerator for sparse triangular solve

Balancing spatial locality and parallelism in dataflow design

Enhancing performance and energy efficiency for SpTRSV computations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Medium granularity dataflow for spatial locality and parallelism

Partial sum caching mechanism to reduce PE blocking

Reordering algorithm of intra-node edges for data reuse

🔎 Similar Papers

TrIM, Triangular Input Movement Systolic Array for Convolutional Neural Networks: Architecture and Hardware Implementation