Wavy Transformer

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Deep Transformer architectures suffer from over-smoothing—where token representations in deeper layers become increasingly homogeneous. To address this, we propose the Wavy Transformer, the first framework to model self-attention as a second-order wave dynamics process, replacing conventional first-order diffusion. This formulation preserves feature diversity during deep propagation. Methodologically, we reinterpret attention hidden states as evolving on a fully connected neural wave system, where position and velocity remain strictly coupled. Based on this physical analogy, we design novel attention layers, feed-forward networks, and normalization modules that intrinsically maintain this state-velocity coupling. The architecture introduces negligible parameter overhead and requires no additional hyperparameter tuning. Extensive experiments across diverse NLP and CV benchmarks demonstrate consistent and significant performance gains, validating both the effectiveness and generality of wave-based modeling for mitigating over-smoothing in deep Transformers.

Technology Category

Application Category

📝 Abstract

Transformers have achieved remarkable success across natural language processing (NLP) and computer vision (CV). However, deep transformer models often suffer from an over-smoothing issue, in which token representations converge to similar values as they pass through successive transformer blocks. In this paper, we establish an equivalence between the hidden-state dynamics induced by stacked attention layers and graph neural diffusion on a complete graph. From this perspective, over-smoothing can be interpreted as a consequence of the dissipative nature of the underlying diffusion dynamics. Motivated by this physical interpretation, we propose Wavy Transformer, which consists of a novel attention layer based on second-order wavy dynamics. We also introduce a feed-forward network and a normalization layer designed to preserve the physical state-velocity relationship under the chain rule, thereby extending the transformer architecture. We further validate our proposed techniques on various transformer models for NLP and CV tasks. The results consistently demonstrate that Wavy Transformer improves performance with minimal additional parameters and no extra hyperparameter tuning.

Problem

Research questions and friction points this paper is trying to address.

Addresses over-smoothing in deep transformer models

Proposes Wavy Transformer with second-order dynamics

Enhances performance without extra hyperparameter tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Second-order wavy dynamics attention layer

State-velocity preserving feed-forward network

Normalization layer for chain rule compatibility

🔎 Similar Papers

No similar papers found.