๐ค AI Summary
This work investigates the origins of MLP activation sparsity in Transformers under standard training, overcoming limitations of existing theories that rely on strong assumptions. By establishing a connection between activation sparsity and loss landscape flatness, the authors show that sparsity equals the ratio of an โamplified flatnessโ measure to the product of input norm and activation gradient magnitude, thereby linking naturally occurring flat minima to this phenomenon for the first time. They further introduce a notion of derivative sparsity compatible with backpropagation, enhancing training stability and enabling gradient pruning. Leveraging flatness analysis, activation gradient modeling, and three plug-and-play sparsity-promoting techniques, the approach achieves at least 36% and 50% improvements in inference and training sparsity, respectively, on ImageNet-1K and C4, substantially reducing computational costs.
๐ Abstract
The observation that activation sparsity emerges in MLP blocks of standardly trained Transformers offers an opportunity to drastically reduce computation costs without sacrificing performance. To theoretically explain this phenomenon, existing works have shown that activation sparsity does not result from the data properties or data fitting but from the implicit bias of the training process. However, these connections are obtained with strong assumptions, which cannot be applied to deep models standardly trained with a large number of steps. Different from these works, we find that the flatness of loss landscapes is also closely related to the MLP activation sparsity and can serve as a weaker and naturally emerging assumption standard deep networks. Specifically, we find that 1) the MLP activation sparsity equals a ratio between "augmented flatness" (a weighted sum of flatness measures) and the product of the input norm and activation gradient of the MLP. We empirically find that this ratio decreases during training, leading to sparse activations. 2) We also propose the notion of derivative sparsity, which reduces to activation sparsity under ReLU, but further enables pruning in the backward propagation and is more stable than activation sparsity. With the theoretical findings, we can further encourage activation sparsity by decreasing the numerator and increasing the denominator of the ratio using three methods. These plug-and-play modifications can effectively reduce the ratio and produce sparser activations. Experiments on ImageNet-1K and C4 demonstrate relative improvements of at least 36% on inference sparsity and at least 50% on training sparsity over vanilla Transformers, indicating further potential cost reduction in both inference and training