μ-ORCA: Optimizing Acceleration for Microsecond-Scale Deep Neural Network Inference on ACAP

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work addresses the inability of existing AMD ACAP frameworks to meet the sub-microsecond latency requirements of small-scale DNN inference tasks—such as jet tagging in high-energy physics—due to inefficient on-chip communication, high inter-layer latency, and the absence of accurate hardware overhead modeling. To overcome these limitations, the authors propose a customized heterogeneous acceleration framework tailored for ACAPs, featuring direct inter-layer communication and a 512-bit/cycle cascaded interconnect that replaces conventional DMA transfers. The framework incorporates a comprehensive performance model accounting for synchronization, VLIW instruction overheads, and supports non-GEMM operators including ReLU, bias addition, and global aggregation. Evaluated on the VEK280 platform, the approach achieves end-to-end inference latency of 0.93 µs for a six-layer DeepSets model, reducing average latency by more than 1.70× compared to state-of-the-art ACAP frameworks and satisfying the stringent 1 µs latency budget.

📝 Abstract

Heterogeneous reconfigurable platforms with tensor cores, such as AMD ACAP, are increasingly adopted for deep neural network (DNN) inference due to their high throughput and flexibility. However, their suitability for microsecond-scale inference on small problem sizes remains underexplored. In jet-tagging applications in high-energy physics, inefficient on-chip communication and large inter-layer latency prevent existing frameworks from meeting the 1-μs latency budget. Moreover, hardware overheads such as synchronization and VLIW processor prologue are often overlooked, making it infeasible to optimize accelerators correctly. To address these problems, we propose μ-ORCA, a customized heterogeneous accelerator framework for ultra-low-latency model inference. μ-ORCA enables direct inter-layer communication between DNN layers on the AIE array, instead of using shared memory tiles or FPGA fabric. Moreover, a 512-bit/cycle cascade connection is applied instead of a 32-bit/cycle DMA connection. μ-ORCA also provides an overhead-aware performance model that adapts to different NN layer sizes, and conducts design space exploration to optimize end-to-end latency. μ-ORCA supports MLP and DeepSets models with non-MM kernels, including bias, ReLU, and global aggregation on AIE. We evaluate μ-ORCA on the AMD ACAP VEK280 platform. Experimental results show that μ-ORCA achieves average latency reduction of >1.70$\times$ and >1.83$\times$ compared with different state-of-the-art ACAP frameworks, and achieves 0.93 μs latency for a 6-layer real-world DeepSets model, satisfying the latency budget. We open source μ-ORCA at https://github.com/arc-research-lab/u-ORCA.

Problem

Research questions and friction points this paper is trying to address.

microsecond-scale inference

inter-layer latency

on-chip communication

hardware overhead

ultra-low-latency DNN

Innovation

Methods, ideas, or system contributions that make the work stand out.

microsecond-scale inference

inter-layer communication

cascade connection