DDFI: Diverse and Distribution-aware Missing Feature Imputation via Two-step Reconstruction

📅 2025-12-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In real-world graph data, node features are often missing due to privacy constraints, severely degrading GNN performance. Existing feature propagation (FP) methods struggle with non-fully-connected graphs, suffer from over-smoothing, and are limited to transductive learning—ignoring feature distribution shift in inductive settings. To address these challenges, we propose DDFI, a two-stage feature imputation framework integrating FP with a graph-structured masked autoencoder (MAE). Its key contributions are: (1) Co-Label Linking, a novel algorithm enabling cross-component feature propagation on weakly connected graphs; (2) a two-step reconstruction mechanism jointly enforcing local consistency and global distribution alignment to mitigate inductive shift and enhance feature diversity; and (3) Sailing, a realistic benchmark dataset featuring naturally missing node features. Extensive experiments across six public benchmarks and Sailing demonstrate that DDFI consistently outperforms state-of-the-art methods under both transductive and inductive settings, significantly improving imputation quality and model robustness.

Technology Category

Application Category

📝 Abstract
Incomplete node features are ubiquitous in real-world scenarios, e.g., the attributes of web users may be partly private, which causes the performance of Graph Neural Networks (GNNs) to decline significantly. Feature propagation (FP) is a well-known method that performs well for imputation of missing node features on graphs, but it still has the following three issues: 1) it struggles with graphs that are not fully connected, 2) imputed features face the over-smoothing problem, and 3) FP is tailored for transductive tasks, overlooking the feature distribution shift in inductive tasks. To address these challenges, we introduce DDFI, a Diverse and Distribution-aware Missing Feature Imputation method that combines feature propagation with a graph-based Masked AutoEncoder (MAE) in a nontrivial manner. It first designs a simple yet effective algorithm, namely Co-Label Linking (CLL), that randomly connects nodes in the training set with the same label to enhance the performance on graphs with numerous connected components. Then we develop a novel two-step representation generation process at the inference stage. Specifically, instead of directly using FP-imputed features as input during inference, DDFI further reconstructs the features through the whole MAE to reduce feature distribution shift in the inductive tasks and enhance the diversity of node features. Meanwhile, since existing feature imputation methods for graphs only evaluate by simulating the missing scenes with manually masking the features, we collect a new dataset called Sailing from the records of voyages that contains naturally missing features to help better evaluate the effectiveness. Extensive experiments conducted on six public datasets and Sailing show that DDFI outperforms the state-of-the-art methods under both transductive and inductive settings.
Problem

Research questions and friction points this paper is trying to address.

Imputes missing node features in graphs
Addresses over-smoothing and distribution shift issues
Enhances performance on graphs with disconnected components
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines feature propagation with graph-based Masked AutoEncoder
Uses Co-Label Linking to connect nodes with same label
Reconstructs features through MAE to reduce distribution shift
🔎 Similar Papers
No similar papers found.
Y
Yifan Song
HKUST(GZ)
F
Fenglin Yu
Carnegie Mellon University
Yihong Luo
Yihong Luo
The Hong Kong University of Science and Technology
Generative ModelsDiffusion ModelsEnergy-Based ModelsGraph Neural Network
X
Xingjian Tao
HKUST(GZ)
S
Siya Qiu
HKUST
K
Kai Han
Shanghai University of Finance and Economics
J
Jing Tang
HKUST(GZ) & HKUST