🤖 AI Summary
This work addresses the growing performance gap between processors and memory, which hampers system efficiency, particularly as existing prefetchers struggle under cache reference reordering and out-of-order execution. To overcome these limitations, the paper proposes a novel lookahead prefetching mechanism that integrates path prediction with region-based access graph patterns. By dynamically constructing access graphs through online learning and incorporating confidence-aware control to enhance prediction accuracy, the approach effectively mitigates the sensitivity to reference reordering and enables cross-region prefetching—challenges that plague current designs. When integrated into the L2 cache, the proposed mechanism improves system performance by 31.4% over a non-prefetching baseline and achieves a 6.2% speedup over both Berti and Pythia state-of-the-art prefetchers.
📝 Abstract
The discrepancy between processor speed and memory system performance continues to limit the performance of many workloads. To address the issue, one effective and well studied technique is cache prefetching. Many prefetching designs have been proposed, with varying approaches and effectiveness. For example, SPP is a popular prefetcher that leverages confidence throttled recursion to speculate on the future path of program's references, however it is very susceptible to the reference reordering of higher-level caches and the out-of-order core. Orthogonally, AMPM is another popular approach to prefetching which uses reordering-resistant access maps to identify patterns within a region, but is unable to speculate beyond that region. In this paper, we propose SPPAM, a new approach to prefetching, inspired by prior works such as SPP and AMPM, while addressing their limitations. SPPAM utilizes online-learning to build a set of access-map patterns. These patterns are used in a speculative lookahead which is throttled by a confidence metric. Targeting the second-level cache, SPPAM alongside state-of-the-art prefetchers Berti and Bingo improves system performance by 31.4% over no prefetching and 6.2% over the baseline of Berti and Pythia.