🤖 AI Summary
This work addresses three critical flaws in GNN evaluation for heterogeneous graph learning: (1) suboptimal hyperparameter configurations, (2) insufficient coverage of challenging heterogeneous datasets, and (3) lack of quantitative validation for homophily metrics. We propose a novel trichotomous categorization of heterogeneous graphs—“malignant,” “benign,” and “ambiguous”—and conduct the first quantitative robustness analysis of 11 homophily measures on synthetic graphs. Leveraging 27 benchmark datasets, we perform exhaustive hyperparameter tuning and conduct grouped comparative evaluations across six state-of-the-art GNN architectures. Experimental results reveal that most SOTA methods suffer substantial performance degradation on malignant heterogeneous graphs; moreover, several widely adopted homophily metrics exhibit poor stability under controlled synthetic settings, undermining their reliability for evaluation. This study establishes a more rigorous, reproducible evaluation paradigm and benchmark suite for heterogeneous graph learning.
📝 Abstract
Over the past decade, Graph Neural Networks (GNNs) have achieved great success on machine learning tasks with relational data. However, recent studies have found that heterophily can cause significant performance degradation of GNNs, especially on node-level tasks. Numerous heterophilic benchmark datasets have been put forward to validate the efficacy of heterophily-specific GNNs, and various homophily metrics have been designed to help recognize these challenging datasets. Nevertheless, there still exist multiple pitfalls that severely hinder the proper evaluation of new models and metrics: 1) lack of hyperparameter tuning; 2) insufficient evaluation on the truly challenging heterophilic datasets; 3) missing quantitative evaluation for homophily metrics on synthetic graphs. To overcome these challenges, we first train and fine-tune baseline models on $27$ most widely used benchmark datasets, and categorize them into three distinct groups: malignant, benign and ambiguous heterophilic datasets. We identify malignant and ambiguous heterophily as the truly challenging subsets of tasks, and to our best knowledge, we are the first to propose such taxonomy. Then, we re-evaluate $11$ state-of-the-arts (SOTA) GNNs, covering six popular methods, with fine-tuned hyperparameters on different groups of heterophilic datasets. Based on the model performance, we comprehensively reassess the effectiveness of different methods on heterophily. At last, we evaluate $11$ popular homophily metrics on synthetic graphs with three different graph generation approaches. To overcome the unreliability of observation-based comparison and evaluation, we conduct the first quantitative evaluation and provide detailed analysis.