🤖 AI Summary
This study addresses the challenges faced by existing graph anomaly detection methods in real-world deployment, where large-scale graphs, extremely low anomaly prevalence (e.g., 0.1%), and missing node attributes severely hinder performance. To bridge the gap between laboratory evaluation and practical applicability, this work establishes the first multi-dimensional benchmark tailored for real-world scenarios, encompassing five diverse graph datasets—including two industrial-scale graphs with over 3.7 million nodes—and their controllable variants. The authors systematically evaluate nine representative models, spanning GNN-based and reconstruction-based approaches, across scalability, robustness, and practical utility. Results reveal that most GNNs fail to scale beyond millions of nodes due to memory constraints, achieve near-zero recall under low anomaly ratios, and exhibit high sensitivity to attribute imputation—highlighting a stark discrepancy between idealized experimental results and real-world effectiveness.
📝 Abstract
Graph Anomaly Detection (GAD) is a critical task in graph machine learning with vital applications in financial fraud detection and social platform governance. However, existing GAD benchmarks are often restricted to small-scale, curated graphs with relatively balanced anomaly ratios, leaving a substantial gap between academic evaluation and real-world deployment. To bridge this gap, we present a multi-dimensional benchmark that systematically evaluates GAD models under three deployment-relevant challenges: million-scale graphs, extreme anomaly scarcity, and missing node attributes. We derive a family of controlled benchmark variants from five diverse graphs, including two native industrial-scale datasets with over 3.7 million nodes. Our extensive evaluation of nine representative GAD models reveals three major limitations: (1) most GNN-based methods fail to scale to million-node graphs due to prohibitive memory requirements; (2) detection performance drops sharply under realistic anomaly ratios (e.g., 0.1\%), often resulting in zero recall; and (3) reconstruction-based models are highly sensitive to attribute imputation strategies. Our findings suggest that strong performance in laboratory settings does not guarantee robustness in production environments. We release this benchmark and empirical evaluation as a diagnostic testbed to promote the development of robust and scalable GAD systems for large-scale, imperfect graphs encountered in practice. Code is available at https://anonymous.4open.science/r/Benchmark_GAD-E7A3.