🤖 AI Summary
This work addresses the deployment challenges of Vision Transformers on edge devices, primarily caused by the computational overhead of Layer Normalization and global reduction bottlenecks. The authors propose a hardware-friendly, training-free compression method that replaces Layer Normalization with heterogeneous scalar functions evolved per layer via genetic programming, coupled with a post-training realignment strategy to recover model performance. This approach enables the first layer-wise customized approximation of normalization operations, achieving a Top-1 accuracy of 84.25% on ImageNet-1K after only 20 fine-tuning epochs. The evolved scalar functions closely approximate the original normalization behavior, attaining an R² coefficient of 91.6%, while substantially reducing both computational and memory costs.
📝 Abstract
Vision Transformers (ViTs) achieve state-of-the-art performance on challenging vision tasks, but their deployment on edge devices is severely hindered by the computational complexity and global reduction bottleneck imposed by layer normalization. Recent methods attempt to bypass this by replacing normalization layers with hardware-friendly scalar approximations. However, these homogeneous replacements do not optimally fit to all layers' behaviour and rely on expensive model retraining. In this work, we propose a highly efficient, hardware-aware framework that utilizes genetic programming (GP) to evolve heterogeneous, layer-specific scalar functions directly from pre-trained weights. Coupled with a novel post-training re-alignment strategy, our approach eliminates the need to retrain models from scratch entirely. Our evolved expressions accurately approximate the target normalization behaviours, capturing $91.6\%$ of the variance ($R^2$) compared to only $70.2\%$ for homogeneous baselines, allowing our modified architecture to recover $84.25\%$ Top-1 ImageNet-1K accuracy in only 20 epochs. By preserving this performance while eliminating the global reduction bottleneck, our approach establishes a highly favourable trade-off between arithmetic complexity and off-chip memory traffic, removing a primary barrier to the efficient deployment of ViTs on edge accelerators.