Learning Advanced Self-Attention for Linear Transformers in the Singular Value Domain

📅 2025-05-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Standard Transformer self-attention exhibits limited frequency-domain expressivity, effectively implementing only a first-order polynomial low-pass filter. Method: This paper reformulates self-attention as a learnable, high-order directed graph filter in the singular value domain—achieved by mapping attention weights onto their singular value spectrum via SVD. The design preserves linear time complexity O(nd²) while enabling multi-band graph signal modeling, thereby overcoming the conventional low-pass constraint. Technically, it integrates graph signal processing and directed graph filtering theory, and introduces frequency-domain parameterization alongside linear attention mechanisms. Contribution/Results: The approach achieves state-of-the-art performance on Long Range Arena and multiple time-series classification benchmarks, demonstrating significant improvements in long-range dependency modeling and spectral awareness.

Technology Category

Application Category

📝 Abstract
Transformers have demonstrated remarkable performance across diverse domains. The key component of Transformers is self-attention, which learns the relationship between any two tokens in the input sequence. Recent studies have revealed that the self-attention can be understood as a normalized adjacency matrix of a graph. Notably, from the perspective of graph signal processing (GSP), the self-attention can be equivalently defined as a simple graph filter, applying GSP using the value vector as the signal. However, the self-attention is a graph filter defined with only the first order of the polynomial matrix, and acts as a low-pass filter preventing the effective leverage of various frequency information. Consequently, existing self-attention mechanisms are designed in a rather simplified manner. Therefore, we propose a novel method, called underline{ extbf{A}}ttentive underline{ extbf{G}}raph underline{ extbf{F}}ilter (AGF), interpreting the self-attention as learning the graph filter in the singular value domain from the perspective of graph signal processing for directed graphs with the linear complexity w.r.t. the input length $n$, i.e., $mathcal{O}(nd^2)$. In our experiments, we demonstrate that AGF achieves state-of-the-art performance on various tasks, including Long Range Arena benchmark and time series classification.
Problem

Research questions and friction points this paper is trying to address.

Self-attention acts as limited low-pass filter
Existing mechanisms overly simplified frequency utilization
Proposes AGF for advanced graph filtering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning graph filter in singular value domain
Linear complexity with respect to input length
Achieves state-of-the-art performance on benchmarks
🔎 Similar Papers