🤖 AI Summary
To address high startup latency and low bus utilization of conventional DMACs in small-scale, irregular data transfers, this paper proposes a lightweight descriptor-based DMA controller tailored for RISC-V Linux systems. Our design introduces a streamlined descriptor format and a penalty-free speculative prefetching mechanism, enabling a high-frequency, area-efficient, AXI4-compliant implementation on FPGA and advanced process nodes. Compared to commercial IP cores, it reduces transfer startup latency by 1.66× and improves bus utilization by up to 2.5× (ideal memory) or 3.6× (deep memory). Hardware overhead is reduced by 11% LUTs and 23% FFs, with no BRAM required. The synthesized design achieves 1.44 GHz operation at 49.5 kGE, delivering the first high-performance optimization for small-data movement in open-source RISC-V SoCs.
📝 Abstract
With the ever-growing heterogeneity in computing systems, driven by modern machine learning applications, pressure is increasing on memory systems to handle arbitrary and more demanding transfers efficiently. Descriptor-based direct memory access controllers (DMACs) allow such transfers to be executed by decoupling memory transfers from processing units. Classical descriptor-based DMACs are inefficient when handling arbitrary transfers of small unit sizes. Excessive descriptor size and the serialized nature of processing descriptors employed by the DMAC lead to large static overheads when setting up transfers. To tackle this inefficiency, we propose a descriptor-based DMAC optimized to efficiently handle arbitrary transfers of small unit sizes. We implement a lightweight descriptor format in an AXI4-based DMAC. We further increase performance by implementing a low-overhead speculative descriptor prefetching scheme without additional latency penalties in the case of a misprediction. Our DMAC is integrated into a 64-bit Linux-capable RISC-V SoC and emulated on a Kintex FPGA to evaluate its performance. Compared to an off-the-shelf descriptor-based DMAC IP, we achieve 1.66x less latency launching transfers, increase bus utilization up to 2.5x in an ideal memory system with 64-byte-length transfers while requiring 11% fewer lookup tables, 23% fewer flip-flops, and no block RAMs. We can extend our lead in bus utilization to 3.6x with 64-byte-length transfers in deep memory systems. We synthesized our DMAC in GlobalFoundries' GF12LP+ node, achieving a clock frequency of over 1.44 GHz while occupying only 49.5 kGE.