Dato: A Task-Based Programming Model for Dataflow Accelerators

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep learning workloads are increasingly memory-bandwidth-bound, with compute cores frequently stalling due to data movement. Existing dataflow accelerator programming models struggle to balance usability and control: low-level interfaces incur high development overhead, while high-level abstractions hide communication and tiling details, impeding performance optimization. This paper introduces Dato, a task-based, Python-embedded programming model for dataflow accelerators. Dato pioneers first-class support for stream and layout types, explicitly modeling data communication and tiling strategies, and enabling automatic compilation from task-graph specifications to virtual–physical mappings. This design unifies high-level abstraction with low-level optimization capability. Evaluated on an AMD Ryzen AI NPU, Dato achieves 84% hardware utilization for GEMM and delivers a 2.81× speedup over commercial frameworks for attention operators. On a custom systolic array implemented on Xilinx Alveo FPGAs, it attains 98% of theoretical peak performance.

Technology Category

Application Category

📝 Abstract
Recent deep learning workloads increasingly push computational demand beyond what current memory systems can sustain, with many kernels stalling on data movement rather than computation. While modern dataflow accelerators incorporate on-chip streaming to mitigate off-chip bandwidth limitations, existing programming models struggle to harness these capabilities effectively. Low-level interfaces provide fine-grained control but impose significant development overhead, whereas high-level tile-based languages abstract away communication details, restricting optimization and forcing compilers to reconstruct the intended dataflow. We present Dato, a Python-embedded, task-based programming model for dataflow accelerators that elevates data communication and sharding to first-class type constructs. Developers write programs as a graph of tasks connected via explicit stream types, with sharded inputs specified using layout types. These tasks are first mapped virtually onto the accelerator's spatial fabric, and the compiler then generates a physical mapping that respects hardware constraints. Experimental results on both AMD Ryzen AI NPU and Alveo FPGA devices demonstrate that Dato achieves high performance while significantly reducing the burden of writing optimized code. On the NPU, Dato attains up to 84% hardware utilization for GEMM and delivers a 2.81x speedup on attention kernels compared to a state-of-the-art commercial framework. On the FPGA, Dato surpasses leading frameworks in performance when generating custom systolic arrays, achieving 98% of the theoretical peak performance.
Problem

Research questions and friction points this paper is trying to address.

Addresses data movement bottlenecks in deep learning workloads
Overcomes limitations of existing dataflow accelerator programming models
Reduces development overhead while enabling fine-grained optimization control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-based programming model with explicit streams
First-class data communication and sharding types
Virtual-to-physical spatial mapping compiler optimization
🔎 Similar Papers
No similar papers found.