π€ AI Summary
This work investigates label memorization in graph neural networks (GNNs) for semi-supervised node classification. We first reveal a significant negative correlation between graph homophily and label memorization: lower homophily intensifies label inconsistency among neighbors, causing GNNs to overfit and memorize training labels more readily. To quantify this phenomenon, we propose NCMemoβa node-level metric that precisely measures individual node memorization strength. Further analysis uncovers an implicit structural bias in GNN training dynamics that exacerbates memorization on low-homophily graphs. Guided by these insights, we design a lightweight graph rewiring strategy that effectively mitigates memorization without compromising classification accuracy, thereby reducing vulnerability to membership inference and other privacy attacks. Our work establishes the first theoretical link between homophily and memorization, offering new perspectives on GNN generalization mechanisms and inherent privacy risks.
π Abstract
Deep neural networks (DNNs) have been shown to memorize their training data, yet similar analyses for graph neural networks (GNNs) remain largely under-explored. We introduce NCMemo (Node Classification Memorization), the first framework to quantify label memorization in semi-supervised node classification. We first establish an inverse relationship between memorization and graph homophily, i.e., the property that connected nodes share similar labels/features. We find that lower homophily significantly increases memorization, indicating that GNNs rely on memorization to learn less homophilic graphs. Secondly, we analyze GNN training dynamics. We find that the increased memorization in low homophily graphs is tightly coupled to the GNNs' implicit bias on using graph structure during learning. In low homophily regimes, this structure is less informative, hence inducing memorization of the node labels to minimize training loss. Finally, we show that nodes with higher label inconsistency in their feature-space neighborhood are significantly more prone to memorization. Building on our insights into the link between graph homophily and memorization, we investigate graph rewiring as a means to mitigate memorization. Our results demonstrate that this approach effectively reduces memorization without compromising model performance. Moreover, we show that it lowers the privacy risk for previously memorized data points in practice. Thus, our work not only advances understanding of GNN learning but also supports more privacy-preserving GNN deployment.