Graph sub-sampling for divide-and-conquer algorithms in large networks

📅 2024-09-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Scalability bottlenecks hinder community detection and core-periphery (CP) structure inference in large-scale networks. Method: We systematically evaluate seven graph subsampling strategies within a divide-and-conquer framework and, for the first time, derive statistically grounded upper bounds on estimation error for CP structure identification under subsampling. Contribution/Results: Theory and experiments reveal task-dependent optimality: uniform node sampling achieves the best community detection performance, whereas core-biased sampling significantly improves CP identification accuracy (up to +37%) and computational efficiency. Compared to full-network algorithms, core-biased subsampling delivers both high efficiency and robustness on real-world and synthetic networks. This work bridges a critical gap in statistical theory by establishing the first formal analysis of task-specific subsampling adaptivity for network structural inference, yielding interpretable, principled guidelines for selecting optimal sampling strategies in large-scale network analysis.

Technology Category

Application Category

📝 Abstract

As networks continue to increase in size, current methods must be capable of handling large numbers of nodes and edges in order to be practically relevant. Instead of working directly with the entire (large) network, analyzing sub-networks has become a popular approach. Due to a network's inherent inter-connectedness, however, sub-sampling is not a trivial task. While this problem has gained popularity in recent years, it has not received sufficient attention from the statistics community. In this work, we provide a thorough comparison of seven graph sub-sampling algorithms by applying them to divide-and-conquer algorithms for community structure and core-periphery (CP) structure. After discussing the various algorithms and sub-sampling routines, we derive theoretical results for the mis-classification rate of the divide-and-conquer algorithm for CP structure under various sub-sampling schemes. We then perform extensive experiments on both simulated and real-world data to compare the various methods. For the community detection task, we found that sampling nodes uniformly at random yields the best performance, but that sometimes the base algorithm applied to the entire network yields better results both in terms of identification and computational time. For CP structure on the other hand, there was no single winner, but algorithms which sampled core nodes at a higher rate consistently outperformed other sampling routines, e.g., random edge sampling and random walk sampling. Unlike community detection, the CP divide-and-conquer algorithm tends to yield better identification results while also being faster than the base algorithm. The varying performance of the sampling algorithms on different tasks demonstrates the importance of carefully selecting a sub-sampling routine for the specific application.

Problem

Research questions and friction points this paper is trying to address.

Compares graph sub-sampling algorithms for large networks.

Evaluates divide-and-conquer methods for community and CP structures.

Identifies optimal sub-sampling strategies for specific network tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compares seven graph sub-sampling algorithms

Focuses on community and core-periphery structures

Evaluates performance on simulated and real-world data

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Research Scientist