🤖 AI Summary
Large language models (LLMs) exhibit “graph hallucination”—generating outputs inconsistent with ground-truth graph structures—yet systematic evaluation of their capabilities in graph-related tasks remains lacking. Method: We propose a zero-shot, prompt-engineering–based evaluation framework to assess LLMs on graph *paraphrasing* (e.g., the Karate Club graph) and *generation* (e.g., Erdős–Rényi random graphs), integrating custom graph-structural similarity metrics and a rigorous hallucination detection protocol. Contribution/Results: Our work introduces the first quantitative measure of graph hallucination rate. We find that paraphrasing hallucination rates strongly correlate with established LLM leaderboards, suggesting their utility as a novel reliability metric. In contrast, random graph generation exhibits stable, cross-model reproducibility, revealing an emergent capability for structural graph synthesis. This study establishes a reproducible benchmark and a new evaluation paradigm at the intersection of graph learning and LLMs.
📝 Abstract
Large Language Models (LLMs) are nowadays prompted for a wide variety of tasks. In this article, we investigate their ability in reciting and generating graphs. We first study the ability of LLMs to regurgitate well known graphs from the literature (e.g. Karate club or the graph atlas)4. Secondly, we question the generative capabilities of LLMs by asking for Erdos-Renyi random graphs. As opposed to the possibility that they could memorize some Erdos-Renyi graphs included in their scraped training set, this second investigation aims at studying a possible emergent property of LLMs. For both tasks, we propose a metric to assess their errors with the lens of hallucination (i.e. incorrect information returned as facts). We most notably find that the amplitude of graph hallucinations can characterize the superiority of some LLMs. Indeed, for the recitation task, we observe that graph hallucinations correlate with the Hallucination Leaderboard, a hallucination rank that leverages 10, 000 times more prompts to obtain its ranking. For the generation task, we find surprisingly good and reproducible results in most of LLMs. We believe this to constitute a starting point for more in-depth studies of this emergent capability and a challenging benchmark for their improvements. Altogether, these two aspects of LLMs capabilities bridge a gap between the network science and machine learning communities.