🤖 AI Summary
This work addresses the hardware bottleneck posed by the nonlinear 5×5 template of locally coupled Kuramoto drift kernels, which is inefficient to deploy on existing CNN or matrix accelerators for edge diffusion models. The paper presents the first dedicated digital systolic array architecture tailored to this drift kernel, reformulating phase coupling into center-independent neighborhood accumulation followed by a single center-dependent multiply-subtract operation. This enables regular systolic execution and register-level data reuse while eliminating transcendental function computations within processing elements. Implemented as synthesizable RTL integrated into a lightweight RISC-V SoC, the design is validated on an FPGA prototype and synthesized in 45nm CMOS. Compared to a software implementation on the host SoC processor, it achieves 193× lower latency and 69.4× lower energy consumption; against a CUDA implementation on Jetson Orin Nano, it delivers 6.57× higher speed and 46.0× lower energy per pixel.
📝 Abstract
Diffusion inference remains costly for edge deployment, yet existing accelerators focus almost exclusively on score networks because standard drift is merely a trivial linear scaling. Kuramoto orientation diffusion replaces this trivial drift with locally coupled phase interactions, improving sampling efficiency but introducing a new hardware bottleneck: a center-dependent nonlinear 5 x 5 stencil evaluated at every reverse step. This kernel maps poorly to conventional CNN accelerators and matrix-oriented engines. We present SA-Kura, to our knowledge the first digital systolic-array accelerator dedicated to locally coupled Kuramoto drift. By reformulating pair-wise sinusoidal coupling into neighbor accumulation independent of the center phase followed by a single center-dependent multiply-subtract combination, SA-Kura eliminates in-PE transcendental units and enables regular systolic execution with register-level reuse. SA-Kura was implemented in synthesizable RTL, integrated into a lightweight RISC-V-based SoC, prototyped on FPGA, and evaluated through 45 nm CMOS synthesis and power analysis. For the drift kernel only, compared with software execution of the same kernel on the processor core in the same SoC platform, SA-Kura reduces latency and energy by 193x and 69.4x, respectively. Compared with a standalone Jetson Orin Nano CUDA implementation of the same kernel, it is 6.57x faster and achieves approximately 46.0x lower energy per pixel.