🤖 AI Summary
It remains unclear whether current graph-tokenized large language models genuinely comprehend graph tokens embedded within natural language. This work proposes GTEval—the first systematic evaluation framework for graph token understanding—integrating unified modeling, instruction transformation strategies, attention analysis, and instruction tuning, with experiments conducted across six representative models. The study reveals that existing models are generally either overly sensitive or insensitive to instruction variations and heavily rely on textual reasoning, struggling to effectively leverage graph tokens. Although instruction tuning yields modest improvements, significant bottlenecks persist in graph token comprehension. This research establishes a new evaluation benchmark and analytical perspective for modeling the integration of graphs and language.
📝 Abstract
The remarkable success of large language models (LLMs) has motivated researchers to adapt them as universal predictors for various graph tasks. As a widely recognized paradigm, Graph-Tokenizing LLMs (GTokenLLMs) compress complex graph data into graph tokens and treat them as prefix tokens for querying LLMs, leading many to believe that LLMs can understand graphs more effectively and efficiently. In this paper, we challenge this belief: \textit{Do GTokenLLMs fully understand graph tokens in the natural-language embedding space?} Motivated by this question, we formalize a unified framework for GTokenLLMs and propose an evaluation pipeline, \textbf{GTEval}, to assess graph-token understanding via instruction transformations at the format and content levels. We conduct extensive experiments on 6 representative GTokenLLMs with GTEval. The primary findings are as follows: (1) Existing GTokenLLMs do not fully understand graph tokens. They exhibit over-sensitivity or over-insensitivity to instruction changes, and rely heavily on text for reasoning; (2) Although graph tokens preserve task-relevant graph information and receive attention across LLM layers, their utilization varies across models and instruction variants; (3) Additional instruction tuning can improve performance on the original and seen instructions, but it does not fully address the challenge of graph-token understanding, calling for further improvement.