🤖 AI Summary
This work addresses the challenge of non-missing-completely-at-random (non-MCAR) node feature missingness, a common yet underexplored issue in real-world graph neural network (GNN) applications. While existing methods often rely on high-dimensional sparse features and the MCAR assumption, limiting their generalizability, this study establishes a more realistic evaluation framework by introducing new datasets with dense semantic features and a standardized protocol for non-MCAR missingness mechanisms. Building upon missing data theory, the authors propose GNNmim, a lightweight and robust model designed to handle diverse missing patterns. Extensive experiments demonstrate that GNNmim consistently matches or surpasses the performance of specialized models across multiple datasets and missingness scenarios, confirming its effectiveness and strong generalization capability.
📝 Abstract
Handling missing node features is a key challenge for deploying Graph Neural Networks (GNNs) in real-world domains such as healthcare and sensor networks. Existing studies mostly address relatively benign scenarios, namely benchmark datasets with (a) high-dimensional but sparse node features and (b) incomplete data generated under Missing Completely At Random (MCAR) mechanisms. For (a), we theoretically prove that high sparsity substantially limits the information loss caused by missingness, making all models appear robust and preventing a meaningful comparison of their performance. To overcome this limitation, we introduce one synthetic and three real-world datasets with dense, semantically meaningful features. For (b), we move beyond MCAR and design evaluation protocols with more realistic missingness mechanisms. Moreover, we provide a theoretical background to state explicit assumptions on the missingness process and analyze their implications for different methods. Building on this analysis, we propose GNNmim, a simple yet effective baseline for node classification with incomplete feature data. Experiments show that GNNmim is competitive with respect to specialized architectures across diverse datasets and missingness regimes.