Spatz: Clustering Compact RISC-V-Based Vector Units to Maximize Computing Efficiency

📅 2023-09-18

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

224K/year

🤖 AI Summary

To address hardware bottlenecks in energy-efficient floating-point computation, this work proposes Spatz—a compact 64-bit floating-point vector processor based on the RISC-V Zve64d extension—along with a scalable dual-core cluster architecture. Innovatively, it employs a minimal 2 KiB latch-based vector register file and a shared scratchpad memory, substantially reducing area and power overhead in GlobalFoundries’ 12LPP process. The proposed FPU-intensive cluster achieves 95.0% FPU utilization under realistic workloads such as 2D convolution, delivering a peak performance of 15.7 DP-GFLOPS at 1 GHz and 0.80 V. It attains an energy efficiency of 95.7 DP-GFLOPS/W overall, improving to 99.3 DP-GFLOPS/W and 171 DP-GFLOPS/W/mm² under convolution workloads—representing a 30% energy-efficiency gain over scalar-core clusters of equivalent die area.

📝 Abstract

The ever-increasing computational and storage requirements of modern applications and the slowdown of technology scaling pose major challenges to designing and implementing efficient computer architectures. To mitigate the bottlenecks of typical processor-based architectures on both the instruction and data sides of the memory, we present Spatz, a compact 64-bit floating-point-capable vector processor based on RISC-V's Vector Extension Zve64d. Using Spatz as the main Processing Element (PE), we design an open-source dual-core vector processor architecture based on a modular and scalable cluster sharing a Scratchpad Memory (SCM). Unlike typical vector processors, whose Vector Register Files (VRFs) are hundreds of KiB large, we prove that Spatz can achieve peak energy efficiency with a latch-based VRF of only 2 KiB. An implementation of the Spatz-based cluster in GlobalFoundries' 12LPP process with eight double-precision Floating Point Units (FPUs) achieves an FPU utilization just 3.4% lower than the ideal upper bound on a double-precision, floating-point matrix multiplication. The cluster reaches 7.7 FMA/cycle, corresponding to 15.7 DP-GFLOPS and 95.7 DP-GFLOPS/W at 1 GHz and nominal operating conditions (TT, 0.80V, 25C), with more than 55% of the power spent on the FPUs. Furthermore, the optimally-balanced Spatz-based cluster reaches a 95.0% FPU utilization (7.6 FMA/cycle), 15.2 DP-GFLOPS, and 99.3 DP-GFLOPS/W (61% of the power spent in the FPU) on a 2D workload with a 7x7 kernel, resulting in an outstanding area/energy efficiency of 171 DP-GFLOPS/W/mm2. At equi-area, the computing cluster built upon compact vector processors reaches a 30% higher energy efficiency than a cluster with the same FPU count built upon scalar cores specialized for stream-based floating-point computation.

Problem

Research questions and friction points this paper is trying to address.

High-performance computing

Energy efficiency

Scalable processor design

Innovation

Methods, ideas, or system contributions that make the work stand out.

RISC-V

Energy Efficiency

Modular Architecture

🔎 Similar Papers

Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow