DEER: Deep Runahead for Instruction Prefetching on Modern Mobile Workloads

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Modern mobile workloads suffer from severe frontend stalls due to code bloat and long-period repetition; existing instruction prefetchers exhibit narrow coverage, poor timeliness, or excessive overhead. This paper proposes a hardware–software co-designed deep lookahead prefetching mechanism: (1) a novel dynamic profiling technique that skips loops and recursive calls to enable hundreds-of-instructions-ahead prediction; (2) hardware integration of an enhanced Return Address Stack (RAS) to support prefetching along deeply nested call-return paths; and (3) metadata residency in DRAM, reducing on-chip storage overhead to near zero. Evaluated on real mobile workloads, our approach reduces L2 instruction miss rates by 19.6% on average (up to 45%) and improves performance by 4.7% on average (up to 8%). Its effectiveness is four times that of state-of-the-art hardware-only replay prefetchers, while requiring two orders of magnitude less on-chip storage.

Technology Category

Application Category

📝 Abstract

Mobile workloads incur heavy frontend stalls due to increasingly large code footprints as well as long repeat cycles. Existing instruction-prefetching techniques suffer from low coverage, poor timeliness, or high cost. We provide a SW/HW co-designed I-prefetcher; DEER uses profile analysis to extract metadata information that allow the hardware to prefetch the most likely future instruction cachelines, hundreds of instructions earlier. This profile analysis skips over loops and recursions to go deeper into the future, and uses a return-address stack on the hardware side to allow prefetch on the return-path from large call-stacks. The produced metadata table is put in DRAM, pointed to by an in-hardware register; the high depth of the lookahead allows to preload the metadata in time and thus nearly no on-chip metadata storage is needed. Gem5 evaluation on real-world modern mobile workloads shows up to 45% reduction in L2 instruction-miss rate (19.6% on average), resulting in up to 8% speedup (4.7% on average). These gains are up to 4X larger than full-hardware record-and-replay prefetchers, while needing two orders of magnitude smaller on-chip storage.

Problem

Research questions and friction points this paper is trying to address.

Reducing frontend stalls in mobile workloads

Improving instruction prefetching coverage and timeliness

Minimizing on-chip storage for prefetching metadata

Innovation

Methods, ideas, or system contributions that make the work stand out.

SW/HW co-designed I-prefetcher with profile analysis

Skips loops/recursions, uses return-address stack

DRAM metadata table with minimal on-chip storage

🔎 Similar Papers

No similar papers found.

Qualcomm

$228,400.00 - $342,600.00

San Diego, California, United States of America

Research Scientist, AI & Systems Co-design (PhD)