🤖 AI Summary
This work addresses the inefficiency of general-purpose accelerators when handling diverse deep neural network (DNN) workloads by proposing DORA, an instruction-based overlay architecture. DORA explicitly encodes dataflow through a custom instruction set, enabling fine-grained control over data movement, computation, and synchronization, while integrating on-chip memory management with computational parallelism. A companion compiler framework employs a two-stage design space exploration that combines mixed-integer linear programming (MILP) with a heuristic scheduling engine to balance flexibility and performance. Experimental results on the AMD Versal VCK190 platform demonstrate that DORA exhibits less than 5% efficiency variation across workloads, achieves up to 5× higher throughput compared to state-of-the-art accelerators, and attains 90% optimality with its heuristic scheduler under practical constraints.
📝 Abstract
As deep neural networks develop significantly more diverse and complex, achieving high performance and efficiency on complicated DNN models faces pressing challenges. Modern DNN workloads are increasingly diverse in operation types, tensor shapes, and execution dependencies, making it difficult to sustain high hardware efficiency across models. In addition, a generic accelerator often incurs substantial overhead when executing diverse workloads.
To address these problems, we propose DORA, an instruction-based overlay architecture that explicitly describes dataflow via a proposed ISA, enabling fine-grained control of data movement, computation, and synchronization at the layer level. To support flexibility while achieving high performance, DORA adopts a novel on-chip memory management and computation parallelism management mechanism. DORA proposes a compilation framework that can generate instructions for given DNN workloads after a two-stage design space exploration. DORA framework also incorporates a MILP-based and a heuristic-based search engine to generate the schedule solution for different needs and constraints.
We prototype DORA on the AMD Versal VCK190 platform, demonstrating its deployability on existing reconfigurable systems. Experimental results show that DORA maintains stable efficiency, with less than 5\% variation on a single vector processor across workloads exhibiting up to 6$\times$ variation in operation counts. Compared to state-of-the-art accelerators, DORA consistently achieves higher performance, delivering up to 5$\times$ throughput improvement. The heuristic-based scheduler further achieves up to 90\% optimality under practical time constraints. DORA is open-sourced at https://github.com/arc-research-lab/DORA.git.