🤖 AI Summary
Address translation remains a critical performance bottleneck in modern systems, primarily due to the unpredictable nature of virtual-to-physical address (VA→PA) mappings. Existing solutions either rely on large pages or contiguous mappings—whose availability cannot be guaranteed—or incur prohibitive hardware modification costs with marginal gains. This paper proposes a software–hardware co-designed predictive address translation mechanism: the OS implements hash-driven memory allocation to establish predictable VA–PA mappings, while the hardware integrates a lightweight speculative engine enabling concurrent prefetching of data and multi-level page table entries. Crucially, our approach eliminates dependence on large pages or memory contiguity assumptions, substantially reducing TLB miss penalties. Evaluated across 11 data-intensive benchmarks, it achieves 27% average speedup in native execution and 20% in virtualized environments, with a 9% improvement in energy efficiency. RTL validation confirms minimal hardware overhead.
📝 Abstract
Address translation is a major performance bottleneck in modern computing systems. Speculative address translation can hide this latency by predicting the physical address (PA) of requested data early in the pipeline. However, predicting the PA from the virtual address (VA) is difficult due to the unpredictability of VA-to-PA mappings in conventional OSes. Prior works try to overcome this but face two key issues: (i) reliance on large pages or VA-to-PA contiguity, which is not guaranteed, and (ii) costly hardware changes to store speculation metadata with limited effectiveness.
We introduce Revelator, a hardware-OS cooperative scheme enabling highly accurate speculative address translation with minimal modifications. Revelator employs a tiered hash-based allocation strategy in the OS to create predictable VA-to-PA mappings, falling back to conventional allocation when needed. On a TLB miss, a lightweight speculation engine, guided by this policy, generates candidate PAs for both program data and last-level page table entries (PTEs). Thus, Revelator (i) speculatively fetches requested data before translation resolves, reducing access latency, and (ii) fetches the fourth-level PTE before the third-level PTE is accessed, accelerating page table walks.
We prototype Revelator's OS support in Linux and evaluate it in simulation across 11 diverse, data-intensive benchmarks in native and virtualized environments. Revelator achieves average speedups of 27% (20%) in native (virtualized) settings, surpasses a state-of-the-art speculative mechanism by 5%, and reduces energy use by 9% compared to baseline. Our RTL prototype shows minimal area and power overheads on a modern CPU.