🤖 AI Summary
To address the urgent demand for high-throughput, low-power on-chip interconnects in AI accelerators, this paper proposes FlooNoC—a novel open-source Network-on-Chip (NoC) architecture optimized for high-bandwidth data transfer. Departing from conventional cache-coherent systems that prioritize fine-grained, low-latency communication, FlooNoC introduces an end-to-end AXI4-based multi-stream ordered transmission mechanism. It integrates a multi-stream DMA engine, a decoupled low-latency short-message link, a non-blocking switch fabric, and a deeply customized ultra-wide AXI4 physical interface (645 Gb/s per link). Implemented in 12 nm FinFET technology, FlooNoC achieves a total aggregate bandwidth of 103 Tb/s, energy efficiency of 0.15 pJ/B/hop at 0.8 V, and an area overhead of only 3.5% relative to compute unit area. Compared to state-of-the-art NoCs, it delivers over 2× higher bandwidth and more than 3× better energy efficiency.
📝 Abstract
The new generation of domain-specific AI accelerators is characterized by rapidly increasing demands for bulk data transfers, as opposed to small, latency-critical cache line transfers typical of traditional cache-coherent systems. In this article, we address this critical need by introducing the FlooNoC network-on-chip (NoC), featuring very wide, fully advanced extensible interface (AXI4) compliant links designed to meet the massive bandwidth needs at high energy efficiency. At the transport level, nonblocking transactions are supported for latency tolerance. In addition, a novel end-to-end ordering approach for AXI4, enabled by a multistream capable direct memory access (DMA) engine, simplifies network interfaces (NIs) and eliminates interstream dependencies. Furthermore, dedicated physical links are instantiated for short, latency-critical messages. A complete end-to-end reference implementation in 12-nm FinFET technology demonstrates the physical feasibility and power performance area (PPA) benefits of our approach. Using wide links on high levels of metal, we achieve a bandwidth of 645 Gb/s/link and a total aggregate bandwidth of 103 Tb/s for an $8 imes 4$ mesh of processors’ cluster tiles, with a total of 288 RISC-V cores. The NoC imposes a minimal area overhead of only 3.5% per compute tile and achieves a leading-edge energy efficiency of 0.15 pJ/B/hop at 0.8 V. Compared with state-of-the-art (SoA) NoCs, our system offers three times the energy efficiency and more than double the link bandwidth. Furthermore, compared with a traditional AXI4-based multilayer interconnect, our NoC achieves a 30% reduction in area, corresponding to a 47% increase in GFLOPSDP within the same floorplan.