CITE: A Comprehensive Benchmark for Heterogeneous Text-Attributed Graphs on Catalytic Materials

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

The catalytic materials domain lacks large-scale heterogeneous text-attributed graph (TAG) benchmarks for graph learning. Method: We introduce CITE, the first open-source benchmark dataset for this domain, comprising 438K nodes, 1.2M edges, and four semantic relation types. We design a standardized evaluation protocol to systematically assess heterogeneous GNNs, homogeneous graph models, large language models (LLMs), and LLM–graph hybrid methods on node classification, accompanied by ablation studies. Contribution/Results: CITE is the first benchmark to organically integrate heterogeneous structural topology, textual semantics, and domain-specific knowledge. It fills a critical gap in catalytic materials graph learning benchmarks and reveals the synergistic impact of heterogeneity and textual information on model performance. By enabling fair, reproducible comparisons, CITE establishes a rigorous foundation for advancing methodological innovation in heterogeneous graph representation learning for catalysis.

Technology Category

Application Category

📝 Abstract

Text-attributed graphs(TAGs) are pervasive in real-world systems,where each node carries its own textual features. In many cases these graphs are inherently heterogeneous, containing multiple node types and diverse edge types. Despite the ubiquity of such heterogeneous TAGs, there remains a lack of large-scale benchmark datasets. This shortage has become a critical bottleneck, hindering the development and fair comparison of representation learning methods on heterogeneous text-attributed graphs. In this paper, we introduce CITE - Catalytic Information Textual Entities Graph, the first and largest heterogeneous text-attributed citation graph benchmark for catalytic materials. CITE comprises over 438K nodes and 1.2M edges, spanning four relation types. In addition, we establish standardized evaluation procedures and conduct extensive benchmarking on the node classification task, as well as ablation experiments on the heterogeneous and textual properties of CITE. We compare four classes of learning paradigms, including homogeneous graph models, heterogeneous graph models, LLM(Large Language Model)-centric models, and LLM+Graph models. In a nutshell, we provide (i) an overview of the CITE dataset, (ii) standardized evaluation protocols, and (iii) baseline and ablation experiments across diverse modeling paradigms.

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale heterogeneous text-attributed graph benchmarks

Addressing representation learning method development and comparison challenges

Providing standardized evaluation for catalytic materials citation data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest heterogeneous text-attributed graph benchmark

Standardized evaluation protocols for node classification

Four learning paradigms comparison including LLMs

🔎 Similar Papers

Construction and Application of Materials Knowledge Graph in Multidisciplinary Materials Science via Large Language Model