Evolving Layer-Specific Scalar Functions for Hardware-Aware Transformer Adaptation

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

211K/year
🤖 AI Summary
This work addresses the deployment challenges of Vision Transformers on edge devices, primarily caused by the computational overhead of Layer Normalization and global reduction bottlenecks. The authors propose a hardware-friendly, training-free compression method that replaces Layer Normalization with heterogeneous scalar functions evolved per layer via genetic programming, coupled with a post-training realignment strategy to recover model performance. This approach enables the first layer-wise customized approximation of normalization operations, achieving a Top-1 accuracy of 84.25% on ImageNet-1K after only 20 fine-tuning epochs. The evolved scalar functions closely approximate the original normalization behavior, attaining an R² coefficient of 91.6%, while substantially reducing both computational and memory costs.
📝 Abstract
Vision Transformers (ViTs) achieve state-of-the-art performance on challenging vision tasks, but their deployment on edge devices is severely hindered by the computational complexity and global reduction bottleneck imposed by layer normalization. Recent methods attempt to bypass this by replacing normalization layers with hardware-friendly scalar approximations. However, these homogeneous replacements do not optimally fit to all layers' behaviour and rely on expensive model retraining. In this work, we propose a highly efficient, hardware-aware framework that utilizes genetic programming (GP) to evolve heterogeneous, layer-specific scalar functions directly from pre-trained weights. Coupled with a novel post-training re-alignment strategy, our approach eliminates the need to retrain models from scratch entirely. Our evolved expressions accurately approximate the target normalization behaviours, capturing $91.6\%$ of the variance ($R^2$) compared to only $70.2\%$ for homogeneous baselines, allowing our modified architecture to recover $84.25\%$ Top-1 ImageNet-1K accuracy in only 20 epochs. By preserving this performance while eliminating the global reduction bottleneck, our approach establishes a highly favourable trade-off between arithmetic complexity and off-chip memory traffic, removing a primary barrier to the efficient deployment of ViTs on edge accelerators.
Problem

Research questions and friction points this paper is trying to address.

Vision Transformers
layer normalization
hardware-aware adaptation
edge deployment
global reduction bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

hardware-aware adaptation
genetic programming
layer-specific scalar functions
post-training re-alignment
Vision Transformers
🔎 Similar Papers
No similar papers found.
K
Kieran Carrigg
Department of Machine Learning and Neural Computing, Donders Institute for Brain, Cognition, and Behaviour, Thomas van Aquinostraat 4, 6525 GD Nijmegen, The Netherlands
S
Sigur de Vries
Department of Machine Learning and Neural Computing, Donders Institute for Brain, Cognition, and Behaviour, Thomas van Aquinostraat 4, 6525 GD Nijmegen, The Netherlands
A
Amirhossein Sadough
Department of Machine Learning and Neural Computing, Donders Institute for Brain, Cognition, and Behaviour, Thomas van Aquinostraat 4, 6525 GD Nijmegen, The Netherlands
Marcel van Gerven
Marcel van Gerven
Professor of Artificial Cognitive Systems, Donders Institute for Brain, Cognition and Behaviour
Artificial IntelligenceMachine LearningComputational Neuroscience