🤖 AI Summary
To address the prohibitively high memory and computational complexity—O(N²)—of self-attention in Graph Transformers, this paper proposes Random Batch Attention (RBA). RBA introduces the random batch method from computational mathematics into attention computation, reformulating attention weight estimation via particle-system modeling and cross-dimensional parallelism, with theoretical convergence guarantees. By decoupling quadratic dependencies, RBA reduces both time and space complexity to linear—O(N)—while preserving the expressive power of standard self-attention. Extensive experiments on large-scale graph benchmarks demonstrate that RBA significantly reduces memory footprint and accelerates training, without compromising model performance. Moreover, RBA serves as a drop-in replacement for conventional attention modules across diverse Graph Transformer architectures, exhibiting strong generalizability and practical deployability in real-world applications.
📝 Abstract
Attention mechanism is a significant part of Transformer models. It helps extract features from embedded vectors by adding global information and its expressivity has been proved to be powerful. Nevertheless, the quadratic complexity restricts its practicability. Although several researches have provided attention mechanism in sparse form, they are lack of theoretical analysis about the expressivity of their mechanism while reducing complexity. In this paper, we put forward Random Batch Attention (RBA), a linear self-attention mechanism, which has theoretical support of the ability to maintain its expressivity. Random Batch Attention has several significant strengths as follows: (1) Random Batch Attention has linear time complexity. Other than this, it can be implemented in parallel on a new dimension, which contributes to much memory saving. (2) Random Batch Attention mechanism can improve most of the existing models by replacing their attention mechanisms, even many previously improved attention mechanisms. (3) Random Batch Attention mechanism has theoretical explanation in convergence, as it comes from Random Batch Methods on computation mathematics. Experiments on large graphs have proved advantages mentioned above. Also, the theoretical modeling of self-attention mechanism is a new tool for future research on attention-mechanism analysis.