DataMaestro: A Versatile and Efficient Data Streaming Engine Bringing Decoupled Memory Access To Dataflow Accelerators

📅 2025-04-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance and energy-efficiency degradation in DNN inference caused by data movement bottlenecks, this paper proposes DataMaestro—a novel dataflow engine featuring a decoupled streaming memory-access architecture tailored for dataflow accelerators. Key innovations include strict separation of computation and memory access, programmable address generation, fine-grained dynamic prefetching, bank-conflict-aware address-mode switching, and on-the-fly on-chip data transformation to reduce memory traffic. Evaluated on a RISC-V host with FPGA prototyping and VLSI synthesis, DataMaestro achieves near 100% core utilization on Tensor Core–style GEMM accelerators, outperforming state-of-the-art designs by 1.05×–21.39×. The memory-access engine occupies only 6.43% of total chip area and consumes just 15.06% of system power, delivering substantial improvements in energy efficiency.

Technology Category

Application Category

📝 Abstract
Deep Neural Networks (DNNs) have achieved remarkable success across various intelligent tasks but encounter performance and energy challenges in inference execution due to data movement bottlenecks. We introduce DataMaestro, a versatile and efficient data streaming unit that brings the decoupled access/execute architecture to DNN dataflow accelerators to address this issue. DataMaestro supports flexible and programmable access patterns to accommodate diverse workload types and dataflows, incorporates fine-grained prefetch and addressing mode switching to mitigate bank conflicts, and enables customizable on-the-fly data manipulation to reduce memory footprints and access counts. We integrate five DataMaestros with a Tensor Core-like GeMM accelerator and a Quantization accelerator into a RISC-V host system for evaluation. The FPGA prototype and VLSI synthesis results demonstrate that DataMaestro helps the GeMM core achieve nearly 100% utilization, which is 1.05-21.39x better than state-of-the-art solutions, while minimizing area and energy consumption to merely 6.43% and 15.06% of the total system.
Problem

Research questions and friction points this paper is trying to address.

Addresses data movement bottlenecks in DNN inference execution
Enables flexible access patterns for diverse workload types
Reduces memory footprints and access counts via data manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled access/execute architecture for DNN accelerators
Flexible programmable access patterns for diverse workloads
Fine-grained prefetch and addressing mode switching
🔎 Similar Papers
No similar papers found.