Efficient FPGA Implementation of Time-Domain Popcount for Low-Complexity Machine Learning

📅 2025-05-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high-latency and high-power inference bottlenecks imposed by population count (popcount) and argmax operations in low-complexity machine learning models—such as Tsetlin Machines—this paper proposes a time-domain hardware acceleration methodology. Leveraging programmable delay lines (PDLs) and asynchronous arbiters, it implements a delay-competition mechanism that inherently maps popcount and argmax computations onto temporal signal propagation and race dynamics, thus natively supporting asynchronous architectures. Integrated with delay-skew calibration and FPGA-specific optimization, the approach achieves up to 38% reduction in inference latency, 43.1% decrease in dynamic power consumption, and 15% savings in logic resources within an asynchronous Tsetlin Machine implementation. This work establishes the first time-domain co-acceleration paradigm for popcount and argmax, offering a promising solution for ultra-low-power edge AI inference.

Technology Category

Application Category

📝 Abstract
Population count (popcount) is a crucial operation for many low-complexity machine learning (ML) algorithms, including Tsetlin Machine (TM)-a promising new ML method, particularly well-suited for solving classification tasks. The inference mechanism in TM consists of propositional logic-based structures within each class, followed by a majority voting scheme, which makes the classification decision. In TM, the voters are the outputs of Boolean clauses. The voting mechanism comprises two operations: popcount for each class and determining the class with the maximum vote by means of an argmax operation. While TMs offer a lightweight ML alternative, their performance is often limited by the high computational cost of popcount and comparison required to produce the argmax result. In this paper, we propose an innovative approach to accelerate and optimize these operations by performing them in the time domain. Our time-domain implementation uses programmable delay lines (PDLs) and arbiters to efficiently manage these tasks through delay-based mechanisms. We also present an FPGA design flow for practical implementation of the time-domain popcount, addressing delay skew and ensuring that the behavior matches that of the model's intended functionality. By leveraging the natural compatibility of the proposed popcount with asynchronous architectures, we demonstrate significant improvements in an asynchronous TM, including up to 38% reduction in latency, 43.1% reduction in dynamic power, and 15% savings in resource utilization, compared to synchronous TMs using adder-based popcount.
Problem

Research questions and friction points this paper is trying to address.

Optimizing popcount for low-complexity ML algorithms
Reducing computational cost of argmax in Tsetlin Machines
FPGA implementation of time-domain popcount for efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Time-domain popcount using delay lines
FPGA design flow for delay skew
Asynchronous architecture for power reduction
🔎 Similar Papers
No similar papers found.
Shengyu Duan
Shengyu Duan
Newcastle University
M
Marcos L. L. Sartori
Microsystems Research Group, Newcastle University, Newcastle upon Tyne, UK
R
R. Shafik
Microsystems Research Group, Newcastle University, Newcastle upon Tyne, UK
A
Alex Yakovlev
Microsystems Research Group, Newcastle University, Newcastle upon Tyne, UK
Emre Ozer
Emre Ozer
Pragmatic
Processor MicroarchitecturePrinted/Flexible ChipsResource-constrained ML HW