GraphSculptor: Sculpting Pre-training Coreset for Graph Self-supervised Learning

📅 2026-05-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
Graph self-supervised learning suffers from high computational costs and data redundancy due to its reliance on large-scale unlabeled data. This work proposes the first label-free coreset construction method for graphs, which integrates intrinsic structural diversity with contextual semantics generated by language models. By leveraging graph statistical features, graph-to-text generation, pretrained language model embeddings, and cluster-aware sampling, the approach achieves highly efficient data compression. Remarkably, using only 10% of the original data, the method attains 99.6% of the full-data performance while reducing pretraining time by nearly 90%. Theoretical analysis further provides a guaranteed bound on the approximation loss, ensuring robustness and reliability of the compressed representation.
📝 Abstract
Graph self-supervised learning typically relies on large-scale unlabeled datasets, heavily inflating computational costs. However, empirical evidence suggests that these datasets contain substantial redundancy-our analysis reveals that uniformly subsampling 50% of graphs retains over 96% of downstream performance. To exploit this redundancy, we introduce GraphSculptor for pre-training coreset construction. Unlike methods dependent on additional training-time signals or limited solely to topological statistics, GraphSculptor provides a label-free solution that constructs coresets via two complementary perspectives: intrinsic structure and contextual semantics. Concretely, structural diversity is quantified using intrinsic graph statistics, yielding a structural feature vector for each graph, while semantic diversity is captured by utilizing a pre-trained language model to encode descriptions generated via graph-to-text. GraphSculptor integrates these signals into a unified metric space and performs cluster-aware selection to preserve joint structural-semantic diversity. We further derive a theoretical bound on the loss gap between coreset and full-data pre-training, offering theoretical motivation for our selection formulation. Extensive experiments demonstrate that GraphSculptor effectively sculpts the dataset: a 10% coreset achieves 99.6% of full-data performance while reducing pre-training time by nearly 90%, offering a scalable solution for data-efficient graph pre-training.
Problem

Research questions and friction points this paper is trying to address.

graph self-supervised learning
pre-training coreset
data redundancy
computational efficiency
graph representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

graph self-supervised learning
coreset selection
structural-semantic diversity
data-efficient pre-training
graph-to-text encoding
🔎 Similar Papers
No similar papers found.