🤖 AI Summary
To address the performance and energy-efficiency degradation in DNN inference caused by data movement bottlenecks, this paper proposes DataMaestro—a novel dataflow engine featuring a decoupled streaming memory-access architecture tailored for dataflow accelerators. Key innovations include strict separation of computation and memory access, programmable address generation, fine-grained dynamic prefetching, bank-conflict-aware address-mode switching, and on-the-fly on-chip data transformation to reduce memory traffic. Evaluated on a RISC-V host with FPGA prototyping and VLSI synthesis, DataMaestro achieves near 100% core utilization on Tensor Core–style GEMM accelerators, outperforming state-of-the-art designs by 1.05×–21.39×. The memory-access engine occupies only 6.43% of total chip area and consumes just 15.06% of system power, delivering substantial improvements in energy efficiency.
📝 Abstract
Deep Neural Networks (DNNs) have achieved remarkable success across various intelligent tasks but encounter performance and energy challenges in inference execution due to data movement bottlenecks. We introduce DataMaestro, a versatile and efficient data streaming unit that brings the decoupled access/execute architecture to DNN dataflow accelerators to address this issue. DataMaestro supports flexible and programmable access patterns to accommodate diverse workload types and dataflows, incorporates fine-grained prefetch and addressing mode switching to mitigate bank conflicts, and enables customizable on-the-fly data manipulation to reduce memory footprints and access counts. We integrate five DataMaestros with a Tensor Core-like GeMM accelerator and a Quantization accelerator into a RISC-V host system for evaluation. The FPGA prototype and VLSI synthesis results demonstrate that DataMaestro helps the GeMM core achieve nearly 100% utilization, which is 1.05-21.39x better than state-of-the-art solutions, while minimizing area and energy consumption to merely 6.43% and 15.06% of the total system.