k-Maximum Inner Product Attention for Graph Transformers and the Expressive Power of GraphGPS The Expressive Power of GraphGPS

📅 2026-04-04

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Graph Transformers suffer from quadratic complexity due to their fully connected attention mechanism, hindering scalability to large-scale graphs. This work proposes k-Maximum Inner Product (k-MIP) attention, which dynamically selects the top-k most relevant nodes for each query to construct sparse attention, and leverages a sign matrix for efficient computation of attention scores, achieving linear memory complexity and accelerated inference. Theoretically, k-MIP is shown to approximate full attention with arbitrary precision, and this paper establishes, for the first time, an upper bound on the graph discriminative power of the GraphGPS framework based on the S-SEG-WL test. Experiments demonstrate that the method efficiently handles graphs with over 500,000 nodes on a single A100 GPU, yielding nearly an order-of-magnitude speedup in inference while achieving state-of-the-art performance on large-scale benchmarks such as the Long Range Graph Benchmark.

Technology Category

Application Category

📝 Abstract

Graph transformers have shown promise in overcoming limitations of traditional graph neural networks, such as oversquashing and difficulties in modelling long-range dependencies. However, their application to large-scale graphs is hindered by the quadratic memory and computational complexity of the all-to-all attention mechanism. Although alternatives such as linearized attention and restricted attention patterns have been proposed, these often degrade performance or limit expressive power. To better balance efficiency and effectiveness, we introduce k-Maximum Inner Product (k-MIP) attention for graph transformers. k-MIP attention selects the most relevant key nodes per query via a top-k operation, yielding a sparse yet flexible attention pattern. Combined with an attention score computation based on symbolic matrices, this results in linear memory complexity and practical speedups of up to an order of magnitude compared to all-to-all attention, enabling the processing of graphs with over 500k nodes on a single A100 GPU. We provide a theoretical analysis of expressive power, showing that k-MIP attention does not compromise the expressiveness of graph transformers: specifically, we prove that k-MIP transformers can approximate any full-attention transformer to arbitrary precision. In addition, we analyze the expressive power of the GraphGPS framework, in which we integrate our attention mechanism, and establish an upper bound on its graph distinguishing capability in terms of the S-SEG-WL test. Finally, we validate our approach on the Long Range Graph Benchmark, the City-Networks benchmark, and two custom large-scale inductive point cloud datasets, consistently ranking among the top-performing scalable graph transformers.

Problem

Research questions and friction points this paper is trying to address.

graph transformers

attention mechanism

computational complexity

expressive power

large-scale graphs

Innovation

Methods, ideas, or system contributions that make the work stand out.

k-Maximum Inner Product Attention

Graph Transformers

Sparse Attention