Scale When Needed: Adaptive Neuron-level Mixed Precision Quantization Aware Training

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the challenge of deploying deep neural networks on 6G edge devices under extreme compression constraints while preserving model accuracy. Existing mixed-precision quantization methods suffer from coarse granularity, making them ill-suited to capture neuron-level variations in precision requirements. To overcome this limitation, we propose Neuron-level Mixed-Precision Quantization-Aware Training (NMP-QAT), which, for the first time, adaptively assigns discrete bit-widths to individual neurons during training—increasing precision only when necessary—and applies uniformly to both weights and activations. By integrating differentiable proxy functions, straight-through estimators, and a fully discrete inference graph, NMP-QAT significantly outperforms current mixed-precision QAT approaches on MLP and tabular models, achieving superior compression-accuracy trade-offs across both telecom and non-telecom datasets, thereby enabling greener edge AI deployment.

📝 Abstract

Deploying deep neural networks on resource-constrained 6G edge devices demands aggressive compression with minimal accuracy loss. Quantization-Aware Training (QAT) has emerged as a leading compression approach; however, existing mixed-precision methods typically operate at coarse layer- or channel-level granularity. These methods often rely on heuristic or search-based bit-allocation strategies, which may overlook fine-grained variability at the neuron level. We propose Neuron-Level Mixed-Precision QAT (NMP-QAT), where each neuron independently learns its own discrete precision during training. Starting from low-bit precision, NMP-QAT expands bit-width only when training signals demand it, via differentiable surrogates and straight-through estimators, while preserving a fully discrete inference graph. This adaptability extends to both weights and activations, reducing memory movement. Evaluated on telecom and non-telecom datasets across MLP and tabular foundation model architectures, NMP-QAT achieves superior compression-accuracy trade-offs over mixed-precision QAT baselines, making it well-suited for Green AI deployments at the network edge.

Problem

Research questions and friction points this paper is trying to address.

mixed-precision quantization

quantization-aware training

neuron-level granularity

edge AI

model compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neuron-level quantization

Mixed-precision QAT

Differentiable surrogate