HGQ-LUT: Fast LUT-Aware Training and Efficient Architectures for DNN Inference

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the limitations of existing LUT-aware training methods, which suffer from slow training, reliance on manual tuning, and lack of end-to-end deployability. The authors propose HGQ-LUT, a novel approach that introduces LUT-Dense and LUT-Conv layers, leveraging efficient tensor operations during training and automatically compiling to FPGA logic LUTs at inference time. By integrating fine-grained heterogeneous quantization with a LUT-aware resource proxy model, HGQ-LUT enables automatic trade-offs between accuracy and hardware resource usage. This method achieves, for the first time, over 100× acceleration in LUT-aware training on modern GPUs, supports zero-bit pruning and mixed-precision architecture compilation, and establishes a seamless pipeline from training to hardware deployment. It demonstrates state-of-the-art hardware efficiency and has been successfully deployed in the CERN Large Hadron Collider experiments.

Technology Category

Application Category

📝 Abstract

Lookup-table (LUT) based neural networks can deliver ultra-low latency and excellent hardware efficiency on FPGAs by mapping arithmetic operations directly onto the logic primitives. However, state-of-the-art LUT-aware training (LAT) approaches remain difficult to use in practice: they are often orders of magnitude slower to train than conventional networks, require non-trivial manual tuning for hardware efficiency, and lack an end-to-end workflow. This work presents HGQ-LUT, integrated in https://github.com/calad0i/HGQ2, a new LAT approach that achieves state-of-the-art hardware efficiency while accelerating training by over 100 times on modern GPUs. HGQ-LUT introduces LUT-Dense and LUT-Conv layers that are implemented with regular, accelerator-efficient tensor operations during training, which are then compiled into logic LUTs for hardware. By combining these layers with fine-grained, element-wise heterogeneous quantization (including zero-bit pruning) and a LUT-aware resource surrogate, HGQ-LUT enables the automatic exploration of accuracy-resource trade-offs without manual bit-width tuning. We further integrate HGQ-LUT into open-source toolchains, enabling unified design, compilation, and bit-exact verification of hybrid architectures that mix LUT-based with conventional arithmetic blocks. These features make LAT-based DNNs practical for real-world deployment, such as at the CERN Large Hadron Collider's experiments.

Problem

Research questions and friction points this paper is trying to address.

LUT-aware training

DNN inference

hardware efficiency

FPGA

training speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

LUT-aware training

heterogeneous quantization

FPGA acceleration