🤖 AI Summary
This work addresses the limitations of static routing in conventional Mixture-of-Experts (MoE) architectures for graph neural networks, which fail to dynamically allocate computational resources according to node-specific discriminative difficulty, often resulting in underfitting on hard samples and redundant computation on easy ones. To overcome this, the authors propose D2MoE, a novel framework that integrates node difficulty into the MoE routing mechanism for the first time. It leverages real-time prediction entropy to assess node difficulty and introduces a difficulty-aware top-p sparse routing strategy, enabling fine-grained and continuous allocation of expert resources on demand. Evaluated across 13 benchmark datasets, D2MoE achieves state-of-the-art performance, with accuracy improvements up to 7.92% on heterophilic graphs, while reducing memory consumption by 73.07% and training time by 46.53% on large-scale graphs.
📝 Abstract
Mixture-of-Experts (MoE) architectures offer a scalable path for Graph Neural Networks (GNNs) in node classification tasks but typically rely on static and rigid routing strategies that enforce a uniform expert budget or coarse-grained expert toggles on all nodes. This limitation overlooks the varying discriminative difficulty of nodes and leads to under-fitting for hard nodes and redundant computation for easy ones. To resolve this issue, we propose D2MoE, a novel framework that shifts the focus from static expert selection to node-wise expert resource allocation. By using predictive entropy as a real-time proxy for difficulty, D2MoE employs a difficulty-driven top-p routing mechanism to adaptively concentrate expert resources on hard nodes while reducing overhead for easy ones, achieving continuous and fine-grained expert budget scaling for node classification. Experiments on 13 benchmarks demonstrate that D2MoE achieves consistent state-of-the-art performance, surpassing leading baselines by up to 7.92% in accuracy on heterophilous graphs. Notably, on large-scale graphs, it reduces memory consumption by up to 73.07% and training time by 46.53% compared to the best-performing Graph MoE, thereby validating its superior efficiency.