Systolic Sparse Tensor Slices: FPGA Building Blocks for Sparse and Dense AI Acceleration

πŸ“… 2025-02-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Traditional FPGA architectures are optimized for dense computation and struggle to efficiently exploit the prevalent structured sparsity in deep neural networks (DNNs). To address this, we propose a two-dimensional systolic sparse tensor (SST) sliceβ€”a hardware unit integrated into programmable logic. It is the first FPGA-native architecture to uniformly support mixed dense and structured-sparse operations under multiple sparsity patterns, including 1:3, 2:4, and 1:4. Leveraging a reconfigurable systolic design, it balances flexibility and hardware efficiency. Key techniques include structured sparse encoding, routing optimization, and hybrid dataflow scheduling. The resulting GEMM accelerator achieves a 5Γ— higher operating frequency and 10.9% smaller area versus baseline. Evaluated on sparse Vision Transformers (ViTs) and CNNs, it delivers 3.52Γ— speedup with only a 13.3% area overhead.

Technology Category

Application Category

πŸ“ Abstract
FPGA architectures have recently been enhanced to meet the substantial computational demands of modern deep neural networks (DNNs). To this end, both FPGA vendors and academic researchers have proposed in-fabric blocks that perform efficient tensor computations. However, these blocks are primarily optimized for dense computation, while most DNNs exhibit sparsity. To address this limitation, we propose incorporating structured sparsity support into FPGA architectures. We architect 2D systolic in-fabric blocks, named systolic sparse tensor (SST) slices, that support multiple degrees of sparsity to efficiently accelerate a wide variety of DNNs. SSTs support dense operation, 2:4 (50%) and 1:4 (75%) sparsity, as well as a new 1:3 (66.7%) sparsity level to further increase flexibility. When demonstrating on general matrix multiplication (GEMM) accelerators, which are the heart of most current DNN accelerators, our sparse SST-based designs attain up to 5x higher FPGA frequency and 10.9x lower area, compared to traditional FPGAs. Moreover, evaluation of the proposed SSTs on state-of-the-art sparse ViT and CNN models exhibits up to 3.52x speedup with minimal area increase of up to 13.3%, compared to dense in-fabric acceleration.
Problem

Research questions and friction points this paper is trying to address.

Enhancing FPGA for sparse DNN computation.
Designing SST slices for multiple sparsity levels.
Achieving higher speed and lower area in FPGA.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systolic sparse tensor slices
Support multiple sparsity degrees
Enhance FPGA frequency and area efficiency
πŸ”Ž Similar Papers
No similar papers found.