🤖 AI Summary
This work addresses the challenge of efficiently approximating k-graphlet distributions in massive graphs, a task hindered by conventional methods that require loading the entire graph into memory and thus suffer from poor scalability. The authors propose a streaming sampling algorithm that operates with only O(n^{1+c}) memory and performs a constant number—specifically O(1/c)—of graph traversals, thereby breaking the prior Ω(log n) lower bound on traversal complexity. This approach achieves near-optimal asymptotic performance in both memory usage and traversal cost. Leveraging streaming processing, subgraph sampling, and probabilistic approximation techniques, the algorithm substantially outperforms existing methods on both real-world and synthetic graphs, delivering speedups of several orders of magnitude, particularly on moderately dense graphs.
📝 Abstract
In recent years, the problem of computing the frequencies of the induced $k$-vertex subgraphs of a graph, or \emph{$k$-graphlets}, has become central. One approach for this problem is to sample $k$-graphlets randomly. Classic algorithms for $k$-graphlet sampling require loading the entire graph into main memory, making them impractical for massive graphs. To bypass this limitation, Bourreau et al. (NeurIPS 2024) introduced a \emph{streaming} algorithm that through nontrivial techniques makes only $O(\log n)$ passes using $O(n \log n)$ memory. In this work we break their $O(\log n)$-pass bound by giving an algorithm that, for any fixed $c>0$, makes $O(1/c)$ passes using $\tilde O(n^{1+c})$ memory. As a consequence of their lower bound, our algorithm is optimal up to a factor of $\tilde{O}(n^c)$ in the memory usage. We use this sampling algorithm to obtain an efficient method of approximating $k$-graphlet distributions. Experiments on real-world and synthetic graphs show that our algorithm is always at least as good as the one of Bourreau et al., and outperforms it by orders of magnitude on mildly dense graphs.