Graph Tokenization for Bridging Graphs and Transformers

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

This work addresses the challenge of efficiently converting graph-structured data into sequences compatible with general-purpose Transformer models. The authors propose a novel graph tokenization framework that, for the first time, integrates reversible graph serialization—guided by global substructure frequency statistics—with Byte Pair Encoding (BPE) to produce compact token representations that preserve structural semantics while remaining amenable to sequence-based architectures. Notably, this approach requires no modifications to standard Transformer backbones such as BERT, enabling direct application to graph data. Evaluated across 14 established graph benchmark datasets, the method significantly outperforms both conventional graph neural networks and specialized graph Transformers, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

The success of large pretrained Transformers is closely tied to tokenizers, which convert raw input into discrete symbols. Extending these models to graph-structured data remains a significant challenge. In this work, we introduce a graph tokenization framework that generates sequential representations of graphs by combining reversible graph serialization, which preserves graph information, with Byte Pair Encoding (BPE), a widely adopted tokenizer in large language models (LLMs). To better capture structural information, the graph serialization process is guided by global statistics of graph substructures, ensuring that frequently occurring substructures appear more often in the sequence and can be merged by BPE into meaningful tokens. Empirical results demonstrate that the proposed tokenizer enables Transformers such as BERT to be directly applied to graph benchmarks without architectural modifications. The proposed approach achieves state-of-the-art results on 14 benchmark datasets and frequently outperforms both graph neural networks and specialized graph transformers. This work bridges the gap between graph-structured data and the ecosystem of sequence models. Our code is available at \href{https://github.com/BUPT-GAMMA/Graph-Tokenization-for-Bridging-Graphs-and-Transformers}{\color{blue}here}.

Problem

Research questions and friction points this paper is trying to address.

graph tokenization

Transformers

graph-structured data

sequence representation

pretrained models

Innovation

Methods, ideas, or system contributions that make the work stand out.

graph tokenization

reversible graph serialization

Byte Pair Encoding