🤖 AI Summary
This work addresses the challenge of irregular memory accesses, which lack temporal or spatial address patterns and thus evade efficient handling by conventional prefetchers, often at high storage cost. The paper proposes Instruction Correlation-based Prefetching (ICP), a novel mechanism that abandons assumptions about address correlation and instead exploits stable data dependencies among instructions that generate irregular memory accesses. By leveraging outcomes of already-executed instructions to predict future access addresses, ICP achieves lightweight yet effective prefetching. Requiring only 2.1 KB of storage—three orders of magnitude smaller than comparable approaches—it outperforms state-of-the-art temporal prefetchers, delivering speedups of 14.0% over Triangel and 6.0% over DMP on SPEC CPU and GAP benchmarks, respectively.
📝 Abstract
Irregular memory accesses pose challenges for effective and efficient data prefetching. While temporal prefetchers have recently shown promise for irregular memory access patterns, their effectiveness fundamentally depends on temporal address recurrence and large metadata storage. When memory addresses exhibit weak or no recurrence, as in indirect memory accesses, temporal prefetchers achieve limited performance gains while incurring substantial storage overhead. This paper proposes Instruction-Correlation Prefetching (ICP), a new hardware prefetching mechanism that exploits instruction-level correlations rather than memory-address correlations to handle irregular memory accesses. ICP observes that although memory addresses may not repeat, the instructions generating them often recur with stable data-dependency relationships. By learning these persistent instruction correlations, ICP speculatively computes and prefetches future irregular accesses using the execution results of their correlated predecessors. Across irregular SPEC CPU and GAP benchmarks, ICP outperforms the state-of-the-art temporal prefetcher Triangel by 14.0% and the indirect prefetcher DMP by 6.0%, while requiring only 2.1 KB of hardware storage, over three orders of magnitude smaller than temporal prefetchers.