🤖 AI Summary
Traditional CPU-based triangle counting (TC) suffers from memory bandwidth bottlenecks and low data reuse, limiting scalability. This work presents the first efficient TC implementation on a commercial Processing-in-Memory (PIM) platform—UPMEM. We propose a PIM-aware co-optimization framework combining *vertex coloring* and *multi-level sampling*: coloring minimizes inter-core communication, while integrated reservoir sampling, Misra-Gries frequency summaries, and edge uniform sampling jointly balance accuracy and throughput. The system supports both exact and approximate TC modes, as well as dynamic graph processing. Experiments demonstrate speedups of several-fold to over an order of magnitude over state-of-the-art CPU implementations across diverse graph scales, significantly alleviating memory bandwidth pressure. To our knowledge, this is the first TC system fully adapted to real-world PIM hardware, establishing a new paradigm for memory-bound graph analytics.
📝 Abstract
Triangle Counting (TC) is a procedure that involves enumerating the number of triangles within a graph. It has important applications in numerous fields, such as social or biological network analysis and network security. TC is a memory-bound workload that does not scale efficiently in conventional processor-centric systems due to several memory accesses across large memory regions and low data reuse. However, recent Processing-in-Memory (PIM) architectures present a promising solution to alleviate these bottlenecks. Our work presents the first TC algorithm that leverages the capabilities of the UPMEM system, the first commercially available PIM architecture, while at the same time addressing its limitations. We use a vertex coloring technique to avoid expensive communication between PIM cores and employ reservoir sampling to address the limited amount of memory available in the PIM cores' DRAM banks. In addition, our work makes use of the Misra-Gries summary to speed up counting triangles on graphs with high-degree nodes and uniform sampling of the graph edges for quicker approximate results. Our PIM implementation surpasses state-of-the-art CPU-based TC implementations when processing dynamic graphs in Coordinate List format, showcasing the effectiveness of the UPMEM architecture in addressing TC's memory-bound challenges.