🤖 AI Summary
Modern mobile workloads suffer from severe frontend stalls due to code bloat and long-period repetition; existing instruction prefetchers exhibit narrow coverage, poor timeliness, or excessive overhead. This paper proposes a hardware–software co-designed deep lookahead prefetching mechanism: (1) a novel dynamic profiling technique that skips loops and recursive calls to enable hundreds-of-instructions-ahead prediction; (2) hardware integration of an enhanced Return Address Stack (RAS) to support prefetching along deeply nested call-return paths; and (3) metadata residency in DRAM, reducing on-chip storage overhead to near zero. Evaluated on real mobile workloads, our approach reduces L2 instruction miss rates by 19.6% on average (up to 45%) and improves performance by 4.7% on average (up to 8%). Its effectiveness is four times that of state-of-the-art hardware-only replay prefetchers, while requiring two orders of magnitude less on-chip storage.
📝 Abstract
Mobile workloads incur heavy frontend stalls due to increasingly large code footprints as well as long repeat cycles. Existing instruction-prefetching techniques suffer from low coverage, poor timeliness, or high cost. We provide a SW/HW co-designed I-prefetcher; DEER uses profile analysis to extract metadata information that allow the hardware to prefetch the most likely future instruction cachelines, hundreds of instructions earlier. This profile analysis skips over loops and recursions to go deeper into the future, and uses a return-address stack on the hardware side to allow prefetch on the return-path from large call-stacks. The produced metadata table is put in DRAM, pointed to by an in-hardware register; the high depth of the lookahead allows to preload the metadata in time and thus nearly no on-chip metadata storage is needed. Gem5 evaluation on real-world modern mobile workloads shows up to 45% reduction in L2 instruction-miss rate (19.6% on average), resulting in up to 8% speedup (4.7% on average). These gains are up to 4X larger than full-hardware record-and-replay prefetchers, while needing two orders of magnitude smaller on-chip storage.