μ-ORCA: Optimizing Acceleration for Microsecond-Scale Deep Neural Network Inference on ACAP

📅 2026-05-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

225K/year
🤖 AI Summary
This work addresses the inability of existing AMD ACAP frameworks to meet the sub-microsecond latency requirements of small-scale DNN inference tasks—such as jet tagging in high-energy physics—due to inefficient on-chip communication, high inter-layer latency, and the absence of accurate hardware overhead modeling. To overcome these limitations, the authors propose a customized heterogeneous acceleration framework tailored for ACAPs, featuring direct inter-layer communication and a 512-bit/cycle cascaded interconnect that replaces conventional DMA transfers. The framework incorporates a comprehensive performance model accounting for synchronization, VLIW instruction overheads, and supports non-GEMM operators including ReLU, bias addition, and global aggregation. Evaluated on the VEK280 platform, the approach achieves end-to-end inference latency of 0.93 µs for a six-layer DeepSets model, reducing average latency by more than 1.70× compared to state-of-the-art ACAP frameworks and satisfying the stringent 1 µs latency budget.
📝 Abstract
Heterogeneous reconfigurable platforms with tensor cores, such as AMD ACAP, are increasingly adopted for deep neural network (DNN) inference due to their high throughput and flexibility. However, their suitability for microsecond-scale inference on small problem sizes remains underexplored. In jet-tagging applications in high-energy physics, inefficient on-chip communication and large inter-layer latency prevent existing frameworks from meeting the 1-μs latency budget. Moreover, hardware overheads such as synchronization and VLIW processor prologue are often overlooked, making it infeasible to optimize accelerators correctly. To address these problems, we propose μ-ORCA, a customized heterogeneous accelerator framework for ultra-low-latency model inference. μ-ORCA enables direct inter-layer communication between DNN layers on the AIE array, instead of using shared memory tiles or FPGA fabric. Moreover, a 512-bit/cycle cascade connection is applied instead of a 32-bit/cycle DMA connection. μ-ORCA also provides an overhead-aware performance model that adapts to different NN layer sizes, and conducts design space exploration to optimize end-to-end latency. μ-ORCA supports MLP and DeepSets models with non-MM kernels, including bias, ReLU, and global aggregation on AIE. We evaluate μ-ORCA on the AMD ACAP VEK280 platform. Experimental results show that μ-ORCA achieves average latency reduction of >1.70$\times$ and >1.83$\times$ compared with different state-of-the-art ACAP frameworks, and achieves 0.93 μs latency for a 6-layer real-world DeepSets model, satisfying the latency budget. We open source μ-ORCA at https://github.com/arc-research-lab/u-ORCA.
Problem

Research questions and friction points this paper is trying to address.

microsecond-scale inference
inter-layer latency
on-chip communication
hardware overhead
ultra-low-latency DNN
Innovation

Methods, ideas, or system contributions that make the work stand out.

microsecond-scale inference
inter-layer communication
cascade connection
overhead-aware performance model
heterogeneous acceleration