Occamy: A 432-Core Dual-Chiplet Dual-HBM2E 768-DP-GFLOP/s RISC-V System for 8-to-64-bit Dense and Sparse Computing in 12nm FinFET

📅 2025-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low computational efficiency for dense and sparse workloads in AI and HPC mixed environments, this paper proposes the first RISC-V many-core heterogeneous system supporting unified dense and sparse computation across the full FP8–FP64 precision range. The architecture employs a dual-die design (a 12 nm RISC-V compute die coupled with a 65 nm passive Hedwig interposer), dual HBM2E memory stacks, and custom on-chip streaming units (SUs), integrated with a latency-tolerant hierarchical interconnect and kernel-level streaming execution. Experimental results demonstrate: 89% double-precision dense linear algebra (LA) FPU utilization; 5.2× higher sparse-dense LA performance versus state-of-the-art baselines; 187 GCOMP/s for sparse-sparse LA; and 75% and 54% FPU utilization for LLM and GCN inference, respectively—significantly improving sparse computational energy efficiency density and hardware utilization.

Technology Category

Application Category

📝 Abstract
ML and HPC applications increasingly combine dense and sparse memory access computations to maximize storage efficiency. However, existing CPUs and GPUs struggle to flexibly handle these heterogeneous workloads with consistently high compute efficiency. We present Occamy, a 432-Core, 768-DP-GFLOP/s, dual-HBM2E, dual-chiplet RISC-V system with a latency-tolerant hierarchical interconnect and in-core streaming units (SUs) designed to accelerate dense and sparse FP8-to-FP64 ML and HPC workloads. We implement Occamy's compute chiplets in 12 nm FinFET, and its passive interposer, Hedwig, in a 65 nm node. On dense linear algebra (LA), Occamy achieves a competitive FPU utilization of 89%. On stencil codes, Occamy reaches an FPU utilization of 83% and a technology-node-normalized compute density of 11.1 DP-GFLOP/s/mm2,leading state-of-the-art (SoA) processors by 1.7x and 1.2x, respectively. On sparse-dense linear algebra (LA), it achieves 42% FPU utilization and a normalized compute density of 5.95 DP-GFLOP/s/mm2, surpassing the SoA by 5.2x and 11x, respectively. On, sparse-sparse LA, Occamy reaches a throughput of up to 187 GCOMP/s at 17.4 GCOMP/s/W and a compute density of 3.63 GCOMP/s/mm2. Finally, we reach up to 75% and 54% FPU utilization on and dense (LLM) and graph-sparse (GCN) ML inference workloads. Occamy's RTL is freely available under a permissive open-source license.
Problem

Research questions and friction points this paper is trying to address.

Processor Efficiency
Dense and Sparse Data
Machine Learning and High-Performance Computing
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-Performance Computing
Dual-Core and Dual-HBM2E Technology
12nm FinFET Process