🤖 AI Summary
This work addresses the heavy reliance of instruction tuning for large language models on extensive expert annotations in textual attributed graphs, where labeling sensitive or dynamic content is costly and inefficient, and existing approaches struggle to leverage the structural and semantic information of unlabeled nodes. To this end, we propose SIT-Graph, the first model-agnostic semi-supervised framework tailored for graph-based instruction tuning. SIT-Graph employs an iterative self-training mechanism that uses initial labeled data to guide the generation of high-confidence pseudo-labels, progressively expanding the training set. By introducing semi-supervised self-training into graph instruction tuning for the first time, our method substantially reduces dependence on labeled data and achieves performance gains exceeding 20% over state-of-the-art graph instruction tuning approaches across multiple benchmarks under low-labeling regimes.
📝 Abstract
The emergent reasoning capabilities of Large Language Models (LLMs) offer a transformative paradigm for analyzing text-attributed graphs. While instruction tuning is the prevailing method for adapting pre-trained LLMs to graph learning tasks like node classification, it requires a substantial volume of annotated (INSTRUCTION, OUTPUT) pairs deriving from labeled nodes. This requirement is particularly prohibitive in the social domain, where obtaining expert labels for sensitive or evolving content is costly and slow. Furthermore, standard graph instruction tuning fails to exploit the vast amount of unlabeled nodes, which contain latent correlations due to edge connections that are beneficial for downstream predictions. To bridge this gap, we propose a novel Semi-supervised Instruction Tuning pipeline for Graph Learning, named SIT-Graph. Notably, SIT-Graph is model-agnostic and can be seamlessly integrated into any graph instruction tuning method that utilizes LLMs as the predictor. SIT-Graph operates via an iterative self-training process. Initially, the model is fine-tuned using instruction pairs constructed solely from the labeled nodes. Then it generates confidence-filtered pseudo-responses for unlabeled nodes to strategically augment the dataset for the next round of fine-tuning. Finally, this iterative refinement progressively aligns the LLM with the underlying node correlations. Extensive experiments demonstrate that when incorporated into state-of-the-art graph instruction tuning methods, SIT-Graph significantly enhances their performance on text-attributed graph benchmarks, achieving over 20% improvement under the low label ratio settings.