FlooNoC: A 645-Gb/s/link 0.15-pJ/B/hop Open-Source NoC With Wide Physical Links and End-to-End AXI4 Parallel Multistream Support

📅 2024-09-26
🏛️ IEEE Transactions on Very Large Scale Integration (VLSI) Systems
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the urgent demand for high-throughput, low-power on-chip interconnects in AI accelerators, this paper proposes FlooNoC—a novel open-source Network-on-Chip (NoC) architecture optimized for high-bandwidth data transfer. Departing from conventional cache-coherent systems that prioritize fine-grained, low-latency communication, FlooNoC introduces an end-to-end AXI4-based multi-stream ordered transmission mechanism. It integrates a multi-stream DMA engine, a decoupled low-latency short-message link, a non-blocking switch fabric, and a deeply customized ultra-wide AXI4 physical interface (645 Gb/s per link). Implemented in 12 nm FinFET technology, FlooNoC achieves a total aggregate bandwidth of 103 Tb/s, energy efficiency of 0.15 pJ/B/hop at 0.8 V, and an area overhead of only 3.5% relative to compute unit area. Compared to state-of-the-art NoCs, it delivers over 2× higher bandwidth and more than 3× better energy efficiency.

Technology Category

Application Category

📝 Abstract
The new generation of domain-specific AI accelerators is characterized by rapidly increasing demands for bulk data transfers, as opposed to small, latency-critical cache line transfers typical of traditional cache-coherent systems. In this article, we address this critical need by introducing the FlooNoC network-on-chip (NoC), featuring very wide, fully advanced extensible interface (AXI4) compliant links designed to meet the massive bandwidth needs at high energy efficiency. At the transport level, nonblocking transactions are supported for latency tolerance. In addition, a novel end-to-end ordering approach for AXI4, enabled by a multistream capable direct memory access (DMA) engine, simplifies network interfaces (NIs) and eliminates interstream dependencies. Furthermore, dedicated physical links are instantiated for short, latency-critical messages. A complete end-to-end reference implementation in 12-nm FinFET technology demonstrates the physical feasibility and power performance area (PPA) benefits of our approach. Using wide links on high levels of metal, we achieve a bandwidth of 645 Gb/s/link and a total aggregate bandwidth of 103 Tb/s for an $8 imes 4$ mesh of processors’ cluster tiles, with a total of 288 RISC-V cores. The NoC imposes a minimal area overhead of only 3.5% per compute tile and achieves a leading-edge energy efficiency of 0.15 pJ/B/hop at 0.8 V. Compared with state-of-the-art (SoA) NoCs, our system offers three times the energy efficiency and more than double the link bandwidth. Furthermore, compared with a traditional AXI4-based multilayer interconnect, our NoC achieves a 30% reduction in area, corresponding to a 47% increase in GFLOPSDP within the same floorplan.
Problem

Research questions and friction points this paper is trying to address.

Addressing high bandwidth demands for AI accelerators efficiently
Enhancing energy efficiency in Network-on-Chip (NoC) designs
Simplifying network interfaces with end-to-end AXI4 ordering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Wide AXI4-compliant links for high bandwidth
Multi-stream DMA engine for simplified interfaces
Dedicated physical links for latency-critical messages
🔎 Similar Papers
No similar papers found.
T
Tim Fischer
Integrated System Laboratory (IIS), ETH Zurich, Switzerland
M
M. Rogenmoser
Integrated System Laboratory (IIS), ETH Zurich, Switzerland
T
Thomas Emanuel Benz
Integrated System Laboratory (IIS), ETH Zurich, Switzerland
Frank K. Gürkaynak
Frank K. Gürkaynak
Senior Scientist, ETH Zurich
Digital VLSI design
Luca Benini
Luca Benini
ETH Zürich, Università di Bologna
Integrated CircuitsComputer ArchitectureEmbedded SystemsVLSIMachine Learning