🤖 AI Summary
Standard Transformer self-attention exhibits limited frequency-domain expressivity, effectively implementing only a first-order polynomial low-pass filter. Method: This paper reformulates self-attention as a learnable, high-order directed graph filter in the singular value domain—achieved by mapping attention weights onto their singular value spectrum via SVD. The design preserves linear time complexity O(nd²) while enabling multi-band graph signal modeling, thereby overcoming the conventional low-pass constraint. Technically, it integrates graph signal processing and directed graph filtering theory, and introduces frequency-domain parameterization alongside linear attention mechanisms. Contribution/Results: The approach achieves state-of-the-art performance on Long Range Arena and multiple time-series classification benchmarks, demonstrating significant improvements in long-range dependency modeling and spectral awareness.
📝 Abstract
Transformers have demonstrated remarkable performance across diverse domains. The key component of Transformers is self-attention, which learns the relationship between any two tokens in the input sequence. Recent studies have revealed that the self-attention can be understood as a normalized adjacency matrix of a graph. Notably, from the perspective of graph signal processing (GSP), the self-attention can be equivalently defined as a simple graph filter, applying GSP using the value vector as the signal. However, the self-attention is a graph filter defined with only the first order of the polynomial matrix, and acts as a low-pass filter preventing the effective leverage of various frequency information. Consequently, existing self-attention mechanisms are designed in a rather simplified manner. Therefore, we propose a novel method, called underline{ extbf{A}}ttentive underline{ extbf{G}}raph underline{ extbf{F}}ilter (AGF), interpreting the self-attention as learning the graph filter in the singular value domain from the perspective of graph signal processing for directed graphs with the linear complexity w.r.t. the input length $n$, i.e., $mathcal{O}(nd^2)$. In our experiments, we demonstrate that AGF achieves state-of-the-art performance on various tasks, including Long Range Arena benchmark and time series classification.