🤖 AI Summary
To address Transformers’ lack of graph-structural priors and their difficulty in jointly modeling local topology and long-range dependencies, this paper proposes the first cross-architecture knowledge distillation paradigm tailored for structural knowledge transfer—migrating multi-scale structural inductive biases from GNNs to Transformers. Methodologically, we introduce a micro-macro distillation loss and a multi-scale feature alignment mechanism that jointly aligns structure-aware representations at the node, subgraph, and whole-graph levels. Our contributions are threefold: (1) systematically bridging the architectural gap between GNNs and Transformers; (2) pioneering a structured distillation objective that explicitly encodes graph-structural priors; and (3) significantly enhancing Transformers’ structural awareness across multiple benchmark datasets, achieving synergistic improvements in both local topological pattern capture and long-range dependency modeling.
📝 Abstract
Integrating the structural inductive biases of Graph Neural Networks (GNNs) with the global contextual modeling capabilities of Transformers represents a pivotal challenge in graph representation learning. While GNNs excel at capturing localized topological patterns through message-passing mechanisms, their inherent limitations in modeling long-range dependencies and parallelizability hinder their deployment in large-scale scenarios. Conversely, Transformers leverage self-attention mechanisms to achieve global receptive fields but struggle to inherit the intrinsic graph structural priors of GNNs. This paper proposes a novel knowledge distillation framework that systematically transfers multiscale structural knowledge from GNN teacher models to Transformer student models, offering a new perspective on addressing the critical challenges in cross-architectural distillation. The framework effectively bridges the architectural gap between GNNs and Transformers through micro-macro distillation losses and multiscale feature alignment. This work establishes a new paradigm for inheriting graph structural biases in Transformer architectures, with broad application prospects.