Wavy Transformer

๐Ÿ“… 2025-08-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Deep Transformer architectures suffer from over-smoothingโ€”where token representations in deeper layers become increasingly homogeneous. To address this, we propose the Wavy Transformer, the first framework to model self-attention as a second-order wave dynamics process, replacing conventional first-order diffusion. This formulation preserves feature diversity during deep propagation. Methodologically, we reinterpret attention hidden states as evolving on a fully connected neural wave system, where position and velocity remain strictly coupled. Based on this physical analogy, we design novel attention layers, feed-forward networks, and normalization modules that intrinsically maintain this state-velocity coupling. The architecture introduces negligible parameter overhead and requires no additional hyperparameter tuning. Extensive experiments across diverse NLP and CV benchmarks demonstrate consistent and significant performance gains, validating both the effectiveness and generality of wave-based modeling for mitigating over-smoothing in deep Transformers.

Technology Category

Application Category

๐Ÿ“ Abstract
Transformers have achieved remarkable success across natural language processing (NLP) and computer vision (CV). However, deep transformer models often suffer from an over-smoothing issue, in which token representations converge to similar values as they pass through successive transformer blocks. In this paper, we establish an equivalence between the hidden-state dynamics induced by stacked attention layers and graph neural diffusion on a complete graph. From this perspective, over-smoothing can be interpreted as a consequence of the dissipative nature of the underlying diffusion dynamics. Motivated by this physical interpretation, we propose Wavy Transformer, which consists of a novel attention layer based on second-order wavy dynamics. We also introduce a feed-forward network and a normalization layer designed to preserve the physical state-velocity relationship under the chain rule, thereby extending the transformer architecture. We further validate our proposed techniques on various transformer models for NLP and CV tasks. The results consistently demonstrate that Wavy Transformer improves performance with minimal additional parameters and no extra hyperparameter tuning.
Problem

Research questions and friction points this paper is trying to address.

Addresses over-smoothing in deep transformer models
Proposes Wavy Transformer with second-order dynamics
Enhances performance without extra hyperparameter tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Second-order wavy dynamics attention layer
State-velocity preserving feed-forward network
Normalization layer for chain rule compatibility
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Satoshi Noguchi
Research Institute for Value-Added Information Generation, Japan Agency for Marine-Earth Science and Technology, RIKEN Center for Advanced Intelligence Project
Yoshinobu Kawahara
Yoshinobu Kawahara
The University of Osaka & RIKEN Center for Advanced Intelligence Project
Machine LearningDynamical SystemsNonlinear Dynamics