🤖 AI Summary
This work addresses the limited expressivity of conventional Transformer feed-forward networks (FFNs), which employ fixed activation functions and cannot adapt their nonlinear transformations to input tokens. To overcome this, the authors propose a Mixture-of-Activations (MoA) mechanism that dynamically blends multiple activation functions through a lightweight, input-dependent gating network atop shared linear projections, while also incorporating a learnable activation (LA) as an input-agnostic baseline. Theoretically, MoA strictly subsumes LA, which in turn strictly generalizes fixed-activation FFNs, thereby significantly enhancing representational capacity under finite width constraints. MoA is compatible with both ReLU- and SwiGLU-style FFN architectures and consistently achieves lower final loss and improved scaling across dense and mixture-of-experts language models ranging from 0.12B to 2B parameters, with minimal computational and parameter overhead.
📝 Abstract
Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, most FFN designs still use a single fixed activation function, applying the same nonlinear transformation to all tokens. In this work, we propose Mixture of Activations (MoA), a token-adaptive FFN design that mixes a dictionary of activation functions using lightweight input-dependent gates while sharing the same linear projections. As an input-independent counterpart, we also introduce learnable activations (LA), which form linear combinations of activation functions for both ReLU-type and SwiGLU-type FFNs. Theoretically, we establish strict finite-width expressive separations among fixed-activation FFNs, LA, and MoA: LA strictly contains fixed-activation FFNs, while MoA strictly contains LA, with the additional expressivity arising from input-dependent nonlinear hybridization. Empirically, we evaluate MoA through extensive pre-training experiments on dense and MoE language models ranging from 0.12B to 2B parameters under different token budgets, optimizers, and learning rate schedules. MoA consistently achieves lower terminal loss and exhibits more favorable scaling behavior than well-tuned baselines, with minimal parameter and computational overhead. These results suggest that token-adaptive activation mixing is a simple and effective mechanism for improving FFN expressivity in LLMs.