๐ค AI Summary
Deep Transformer architectures suffer from over-smoothingโwhere token representations in deeper layers become increasingly homogeneous. To address this, we propose the Wavy Transformer, the first framework to model self-attention as a second-order wave dynamics process, replacing conventional first-order diffusion. This formulation preserves feature diversity during deep propagation. Methodologically, we reinterpret attention hidden states as evolving on a fully connected neural wave system, where position and velocity remain strictly coupled. Based on this physical analogy, we design novel attention layers, feed-forward networks, and normalization modules that intrinsically maintain this state-velocity coupling. The architecture introduces negligible parameter overhead and requires no additional hyperparameter tuning. Extensive experiments across diverse NLP and CV benchmarks demonstrate consistent and significant performance gains, validating both the effectiveness and generality of wave-based modeling for mitigating over-smoothing in deep Transformers.
๐ Abstract
Transformers have achieved remarkable success across natural language processing (NLP) and computer vision (CV). However, deep transformer models often suffer from an over-smoothing issue, in which token representations converge to similar values as they pass through successive transformer blocks. In this paper, we establish an equivalence between the hidden-state dynamics induced by stacked attention layers and graph neural diffusion on a complete graph. From this perspective, over-smoothing can be interpreted as a consequence of the dissipative nature of the underlying diffusion dynamics. Motivated by this physical interpretation, we propose Wavy Transformer, which consists of a novel attention layer based on second-order wavy dynamics. We also introduce a feed-forward network and a normalization layer designed to preserve the physical state-velocity relationship under the chain rule, thereby extending the transformer architecture. We further validate our proposed techniques on various transformer models for NLP and CV tasks. The results consistently demonstrate that Wavy Transformer improves performance with minimal additional parameters and no extra hyperparameter tuning.