CITE: A Comprehensive Benchmark for Heterogeneous Text-Attributed Graphs on Catalytic Materials

πŸ“… 2025-08-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
The catalytic materials domain lacks large-scale heterogeneous text-attributed graph (TAG) benchmarks for graph learning. Method: We introduce CITE, the first open-source benchmark dataset for this domain, comprising 438K nodes, 1.2M edges, and four semantic relation types. We design a standardized evaluation protocol to systematically assess heterogeneous GNNs, homogeneous graph models, large language models (LLMs), and LLM–graph hybrid methods on node classification, accompanied by ablation studies. Contribution/Results: CITE is the first benchmark to organically integrate heterogeneous structural topology, textual semantics, and domain-specific knowledge. It fills a critical gap in catalytic materials graph learning benchmarks and reveals the synergistic impact of heterogeneity and textual information on model performance. By enabling fair, reproducible comparisons, CITE establishes a rigorous foundation for advancing methodological innovation in heterogeneous graph representation learning for catalysis.

Technology Category

Application Category

πŸ“ Abstract
Text-attributed graphs(TAGs) are pervasive in real-world systems,where each node carries its own textual features. In many cases these graphs are inherently heterogeneous, containing multiple node types and diverse edge types. Despite the ubiquity of such heterogeneous TAGs, there remains a lack of large-scale benchmark datasets. This shortage has become a critical bottleneck, hindering the development and fair comparison of representation learning methods on heterogeneous text-attributed graphs. In this paper, we introduce CITE - Catalytic Information Textual Entities Graph, the first and largest heterogeneous text-attributed citation graph benchmark for catalytic materials. CITE comprises over 438K nodes and 1.2M edges, spanning four relation types. In addition, we establish standardized evaluation procedures and conduct extensive benchmarking on the node classification task, as well as ablation experiments on the heterogeneous and textual properties of CITE. We compare four classes of learning paradigms, including homogeneous graph models, heterogeneous graph models, LLM(Large Language Model)-centric models, and LLM+Graph models. In a nutshell, we provide (i) an overview of the CITE dataset, (ii) standardized evaluation protocols, and (iii) baseline and ablation experiments across diverse modeling paradigms.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale heterogeneous text-attributed graph benchmarks
Addressing representation learning method development and comparison challenges
Providing standardized evaluation for catalytic materials citation data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest heterogeneous text-attributed graph benchmark
Standardized evaluation protocols for node classification
Four learning paradigms comparison including LLMs
Chenghao Zhang
Chenghao Zhang
Renmin University of China
Natural Language ProcessingInformation RetrievalMultimodal
Q
Qingqing Long
Computer Network Information Center, China Academy of Sciences
L
Ludi Wang
Computer Network Information Center, China Academy of Sciences
W
Wenjuan Cui
Computer Network Information Center, China Academy of Sciences
Jianjun Yu
Jianjun Yu
Computer Network Information Center, China Academy of Sciences
Yi Du
Yi Du
Chinese Academy of Sciences
data miningknowledge engineeringAI for Science