When Graph Language Models Go Beyond Memorization

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
This study investigates whether graph language models (GLMs) genuinely learn structural regularities or merely rely on memorization. To disentangle memory effects from structural alignment, the authors propose a calibrated diagnostic protocol integrating frequent subgraph mining, graph-level bootstrap baselines, and a three-tier frequency stratification scheme. Through novel subgraph analysis and Spearman rank correlation on a dataset of 3.75 million graphs, they empirically demonstrate that verbatim memorization sharply declines as graph scale increases, while structural alignment remains robust. High-frequency subgraphs are consistently reproduced, though coverage of low-frequency patterns remains limited. These findings indicate that large-scale GLMs indeed transcend mere memorization and acquire generalizable structural regularities.
📝 Abstract
It remains unclear whether graph language models learn structural regularities or merely memorize training graphs; this cannot be resolved by current aggregate fidelity metrics alone. We develop a calibrated diagnostic protocol that combines frequent subgraph mining, a graph-level bootstrap baseline, and three-level frequency stratification to disentangle memorization from structural alignment. Using this framework, we show that graph language models can acquire structural regularities beyond memorization at scale, primarily in the high-frequency regime. This is supported by the following empirical evidence: On five TU benchmarks, LLaMA-style graph language models reach high subgraph-rank correlation, yet their alignment is matched or exceeded by the memorization bootstrap in most cases. At small scale, under our bootstrap diagnostic, fidelity is largely indistinguishable from verbatim recall. In contrast, at large scale with 3.75M graphs, verbatim memorization drops sharply while rank correlation remains near ceiling. Crucially, in a separate fixed-subsample analysis, frequent subgraph mining restricted to the novel-only subset closely tracks the corresponding all-generation Spearman correlation, providing evidence that the alignment is not driven solely by verbatim recall. Across all scales, high-frequency patterns are well reproduced, while rare patterns remain poorly covered, and this deficit narrows only marginally as capacity increases. We observe the same scale-dependent crossover under two distinct graph serializations (canonical DFS code and action sequences), providing evidence of robustness in our analysis.
Problem

Research questions and friction points this paper is trying to address.

graph language models
memorization
structural regularities
fidelity metrics
generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

graph language models
structural generalization
memorization vs. learning
subgraph mining
frequency stratification