da4ml: Distributed Arithmetic for Real-time Neural Networks on FPGAs

📅 2025-07-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying microsecond-latency neural networks on FPGAs faces a severe area bottleneck due to constant matrix–vector multiplication (CMVM) operations. Method: This paper proposes an efficient constant matrix–vector multiplication optimization based on distributed arithmetic (DA), integrating full unrolling with deep pipelining to achieve single-cycle throughput while minimizing hardware resource consumption. The designed DA algorithm jointly optimizes computational efficiency and area utilization. Contribution/Results: Integrated into the hls4ml open-source framework, the method enables end-to-end deployment of highly quantized real-world networks. It reduces on-chip FPGA resource usage by up to 33% and decreases inference latency, thereby enabling deployment of ultra-low-latency models previously infeasible due to resource constraints. This advancement is particularly valuable for latency-critical applications such as high-energy physics experiments.

Technology Category

Application Category

📝 Abstract
Neural networks with a latency requirement on the order of microseconds, like the ones used at the CERN Large Hadron Collider, are typically deployed on FPGAs fully unrolled and pipelined. A bottleneck for the deployment of such neural networks is area utilization, which is directly related to the required constant matrix-vector multiplication (CMVM) operations. In this work, we propose an efficient algorithm for implementing CMVM operations with distributed arithmetic (DA) on FPGAs that simultaneously optimizes for area consumption and latency. The algorithm achieves resource reduction similar to state-of-the-art algorithms while being significantly faster to compute. The proposed algorithm is open-sourced and integrated into the exttt{hls4ml} library, a free and open-source library for running real-time neural network inference on FPGAs. We show that the proposed algorithm can reduce on-chip resources by up to a third for realistic, highly quantized neural networks while simultaneously reducing latency, enabling the implementation of previously infeasible networks.
Problem

Research questions and friction points this paper is trying to address.

Optimizes FPGA area and latency for neural networks
Reduces matrix-vector multiplication resource usage
Enables previously infeasible real-time network deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed arithmetic optimizes FPGA area and latency
Open-sourced algorithm integrated into hls4ml library
Reduces resources by a third, lowers latency
🔎 Similar Papers
2021-07-01IEEE International Conference on Application-Specific Systems, Architectures, and ProcessorsCitations: 13
C
Chang Sun
California Institute of Technology, USA
Z
Zhiqiang Que
Imperial College London, UK
Vladimir Loncar
Vladimir Loncar
CERN
Wayne Luk
Wayne Luk
Professor of Computer Engineering, Imperial College London
Hardware and ArchitectutreReconfigurable ComputingDesign Automation
M
Maria Spiropulu
California Institute of Technology, USA