On Optimizing Locality of Graph Transposition on Modern Architectures

πŸ“… 2025-01-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the high memory overhead and poor cache locality of Graph Transposition (GT) in shared-memory systems, this paper proposes PoTraβ€”a lightweight, architecture-aware GT algorithm. PoTra innovatively integrates the skewed degree distribution of real-world graphs with multi-level CPU cache characteristics, designing a locality-driven, compressed auxiliary data structure sized to approximate L3 cache capacity. It further establishes a quantitative performance model linking cache/memory response times to graph locality. By synergistically incorporating atomic operation optimizations, graph-structure-aware memory layout, and multicore parallel programming, PoTra achieves up to 8.7Γ— speedup across three mainstream CPU architectures and 20 real/synthetic graphs (up to 128 billion edges), with average performance degradation ≀15.7%. The approach significantly improves energy efficiency and scalability for large-scale GT.

Technology Category

Application Category

πŸ“ Abstract
This paper investigates the shared-memory Graph Transposition (GT) problem, a fundamental graph algorithm that is widely used in graph analytics and scientific computing. Previous GT algorithms have significant memory requirements that are proportional to the number of vertices and threads which obstructs their use on large graphs. Moreover, atomic memory operations have become comparably fast on recent CPU architectures, which creates new opportunities for improving the performance of concurrent atomic accesses in GT. We design PoTra, a GT algorithm which leverages graph structure and processor and memory architecture to optimize locality and performance. PoTra limits the size of additional data structures close to CPU cache sizes and utilizes the skewed degree distribution of graph datasets to optimize locality and performance. We present the performance model of PoTra to explain the connection between cache and memory response times and graph locality. Our evaluation of PoTra on three CPU architectures and 20 real-world and synthetic graph datasets with up to 128 billion edges demonstrates that PoTra achieves up to 8.7 times speedup compared to previous works and if there is a performance loss it remains limited to 15.7%, on average.
Problem

Research questions and friction points this paper is trying to address.

Graph Transposition
Memory Efficiency
Processing Speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

PoTra
Graph Transposition
Memory Optimization
πŸ”Ž Similar Papers
No similar papers found.