🤖 AI Summary
This work investigates whether modern vision backbones inherently require pointwise activation functions (e.g., ReLU, GELU) or the exponential Softmax to introduce nonlinearity. To address this, the authors propose a polynomial alternative that eschews explicit activation functions altogether, leveraging Hadamard products to construct higher-order polynomials of the input. This formulation uniformly replaces nonlinear components in MLPs, convolutions, and attention mechanisms, and is integrated into a MetaFormer architecture to yield PolyNeXt—the first fully activation-free vision model. Experiments demonstrate that PolyNeXt matches or surpasses conventional models with activation functions on ImageNet classification, ADE20K semantic segmentation, and out-of-distribution robustness benchmarks, while also outperforming prior polynomial networks with lower computational overhead.
📝 Abstract
Modern vision backbones treat pointwise activations (e.g., ReLU, GELU) and exponential softmax as essential sources of nonlinearity, but we demonstrate they are not required within MetaFormer-style vision backbones. We design activation-free polynomial alternatives for three core primitives (MLPs, convolutions, and attention), where Hadamard products replace standard nonlinearities to yield polynomial functions of the input. These modules integrate seamlessly into existing architectures: instantiated within MetaFormer, a modular framework for vision backbones, our PolyNeXt models match or exceed activation-based counterparts across model scales on ImageNet classification, ADE20K semantic segmentation, and out-of-distribution robustness. We also substantially outperform prior polynomial networks at reduced computational cost, showing that polynomial variants of standard modules beat complex custom architectures.