Bridging Academia and Industry: A Comprehensive Benchmark for Attributed Graph Clustering

πŸ“… 2026-02-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the gap between academic research and industrial practice in attributed graph clustering (AGC), where existing evaluations rely on small-scale, highly homophilic datasets and non-scalable full-batch training that poorly reflect real-world scenarios. To bridge this divide, we propose PyAGCβ€”the first scalable AGC benchmark platform designed for industrial deployment. PyAGC unifies prominent methods under a modular Encode-Cluster-Optimize framework, enabling memory-efficient mini-batch training and distributed scalability while supporting complex tabular features and heterogeneous graph structures. We introduce a benchmark suite of 12 datasets (ranging from 2.7K to 111M nodes), including low-homophily industrial graphs, and establish a comprehensive evaluation protocol combining unsupervised structural metrics with efficiency analysis. The effectiveness of PyAGC is validated in high-stakes applications at Ant Group. The platform is open-sourced to advance reproducible, scalable, and practical AGC research.

Technology Category

Application Category

πŸ“ Abstract
Attributed Graph Clustering (AGC) is a fundamental unsupervised task that integrates structural topology and node attributes to uncover latent patterns in graph-structured data. Despite its significance in industrial applications such as fraud detection and user segmentation, a significant chasm persists between academic research and real-world deployment. Current evaluation protocols suffer from the small-scale, high-homophily citation datasets, non-scalable full-batch training paradigms, and a reliance on supervised metrics that fail to reflect performance in label-scarce environments. To bridge these gaps, we present PyAGC, a comprehensive, production-ready benchmark and library designed to stress-test AGC methods across diverse scales and structural properties. We unify existing methodologies into a modular Encode-Cluster-Optimize framework and, for the first time, provide memory-efficient, mini-batch implementations for a wide array of state-of-the-art AGC algorithms. Our benchmark curates 12 diverse datasets, ranging from 2.7K to 111M nodes, specifically incorporating industrial graphs with complex tabular features and low homophily. Furthermore, we advocate for a holistic evaluation protocol that mandates unsupervised structural metrics and efficiency profiling alongside traditional supervised metrics. Battle-tested in high-stakes industrial workflows at Ant Group, this benchmark offers the community a robust, reproducible, and scalable platform to advance AGC research towards realistic deployment. The code and resources are publicly available via GitHub (https://github.com/Cloudy1225/PyAGC), PyPI (https://pypi.org/project/pyagc), and Documentation (https://pyagc.readthedocs.io).
Problem

Research questions and friction points this paper is trying to address.

Attributed Graph Clustering
benchmark
industrial deployment
evaluation protocol
unsupervised learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attributed Graph Clustering
Mini-batch Training
Low Homophily
Industrial Benchmark
Unsupervised Evaluation
πŸ”Ž Similar Papers
No similar papers found.