🤖 AI Summary
Activation functions exhibit poor generalization across balanced and imbalanced classification tasks; notably, Sigmoid suffers from class bias and degraded performance in long-tailed settings. To address this, we propose Distribution-Aware Parametric Activation (APA), the first activation framework grounded in statistical analysis revealing the intrinsic coupling between activation behavior and data distribution. APA introduces a learnable, task-adaptive unified activation formula, compatible with CNNs, Transformers, multimodal models, and large language models. We further design a data-distribution alignment optimization strategy enabling cross-layer (intermediate and attention layers), cross-task, and cross-architecture transfer. Evaluated on five long-tailed benchmarks—including ImageNet-LT—APA consistently surpasses state-of-the-art methods. Moreover, it delivers consistent performance gains across diverse downstream tasks: object detection, vision-language instruction following, image generation, and text prediction.
📝 Abstract
The activation function plays a crucial role in model optimisation, yet the optimal choice remains unclear. For example, the Sigmoid activation is the de-facto activation in balanced classification tasks, however, in imbalanced classification, it proves inappropriate due to bias towards frequent classes. In this work, we delve deeper in this phenomenon by performing a comprehensive statistical analysis in the classification and intermediate layers of both balanced and imbalanced networks and we empirically show that aligning the activation function with the data distribution, enhances the performance in both balanced and imbalanced tasks. To this end, we propose the Adaptive Parametric Activation (APA) function, a novel and versatile activation function that unifies most common activation functions under a single formula. APA can be applied in both intermediate layers and attention layers, significantly outperforming the state-of-the-art on several imbalanced benchmarks such as ImageNet-LT, iNaturalist2018, Places-LT, CIFAR100-LT and LVIS. Also, we extend APA to a plethora of other tasks such as classification, detection, visual instruction following tasks, image generation and next-text-token prediction benchmarks. APA increases the performance in multiple benchmarks across various model architectures. The code is available at https://github.com/kostas1515/AGLU.