🤖 AI Summary
This paper addresses the long-standing fragmentation between graph-level unsupervised anomaly detection (GLAD) and graph-level out-of-distribution detection (GLOD), along with inconsistent evaluation protocols. We introduce UB-GOLD—the first unified benchmark—comprising 35 datasets across four realistic application scenarios and enabling systematic evaluation of 18 methods. UB-GOLD unifies task definitions and evaluation paradigms for GLAD and GLOD, and establishes a multidimensional analytical framework assessing OOD sensitivity, robustness, efficiency, and more. Leveraging unsupervised graph representation learning, reconstruction error modeling, and statistical outlier scoring, it supports cross-scenario generalization assessment. Experiments reveal that most existing GLAD methods exhibit severely limited generalization to GLOD tasks. To foster reproducibility and advancement, we release an open-source, standardized codebase and evaluation protocol—accelerating the development of secure and robust graph learning systems.
📝 Abstract
To build safe and reliable graph machine learning systems, unsupervised graph-level anomaly detection (GLAD) and unsupervised graph-level out-of-distribution (OOD) detection (GLOD) have received significant attention in recent years. Though those two lines of research indeed share the same objective, they have been studied independently in the community due to distinct evaluation setups, creating a gap that hinders the application and evaluation of methods from one to the other. To bridge the gap, in this work, we present a underline{ extbf{U}}nified underline{ extbf{B}}enchmark for unsupervised underline{ extbf{G}}raph-level underline{ extbf{O}}OD and anomaunderline{ extbf{L}}y underline{ extbf{D}}etection (ourmethod), a comprehensive evaluation framework that unifies GLAD and GLOD under the concept of generalized graph-level OOD detection. Our benchmark encompasses 35 datasets spanning four practical anomaly and OOD detection scenarios, facilitating the comparison of 18 representative GLAD/GLOD methods. We conduct multi-dimensional analyses to explore the effectiveness, OOD sensitivity spectrum, robustness, and efficiency of existing methods, shedding light on their strengths and limitations. Furthermore, we provide an open-source codebase (https://github.com/UB-GOLD/UB-GOLD) of ourmethod to foster reproducible research and outline potential directions for future investigations based on our insights.