🤖 AI Summary
Large language models (LLMs) struggle to effectively model graph-structured data due to their inherent lack of graph-inductive bias and inability to natively capture structural dependencies.
Method: This paper introduces “Graph Language,” a novel paradigm that encodes graph structures into learnable, language-like sequences via structure-aware tokenization and neighborhood-topology-driven graph-to-sequence mapping—enabling compact and expressive representation of high-order graph structures. LLMs are then pretrained and fine-tuned directly on these graph-language sequences, allowing them to comprehend graph topology end-to-end without relying on verbose textual descriptions or node/edge attribute embeddings.
Contribution/Results: Evaluated on three real-world graph datasets, Graph Language consistently outperforms description-based and attribute-embedding baselines, achieving an average 4.2% improvement in node classification accuracy. It represents the first approach to enable LLMs to efficiently and end-to-end model graph structures up to arbitrary order.
📝 Abstract
Recent efforts leverage Large Language Models (LLMs) for modeling text-attributed graph structures in node classification tasks. These approaches describe graph structures for LLMs to understand or aggregate LLM-generated textual attribute embeddings through graph structure. However, these approaches face two main limitations in modeling graph structures with LLMs. (i) Graph descriptions become verbose in describing high-order graph structure. (ii) Textual attributes alone do not contain adequate graph structure information. It is challenging to model graph structure concisely and adequately with LLMs. LLMs lack built-in mechanisms to model graph structures directly. They also struggle with complex long-range dependencies between high-order nodes and target nodes. Inspired by the observation that LLMs pre-trained on one language can achieve exceptional performance on another with minimal additional training, we propose extbf{G}raph- extbf{D}efined extbf{L}anguage for extbf{L}arge extbf{L}anguage extbf{M}odel (GDL4LLM). This novel framework enables LLMs to transfer their powerful language understanding capabilities to graph-structured data. GDL4LLM translates graphs into a graph language corpus instead of graph descriptions and pre-trains LLMs on this corpus to adequately understand graph structures. During fine-tuning, this corpus describes the structural information of target nodes concisely with only a few tokens. By treating graphs as a new language, GDL4LLM enables LLMs to model graph structures adequately and concisely for node classification tasks. Extensive experiments on three real-world datasets demonstrate that GDL4LLM outperforms description-based and textual attribute embeddings-based baselines by efficiently modeling different orders of graph structure with LLMs.