🤖 AI Summary
CGRAs suffer from extremely low utilization (<1.5%) on irregular workloads—such as graph processing and unstructured database operations—due to unpredictable memory access patterns. To address this bottleneck, this paper introduces a runahead execution mechanism and dynamic cache reconfiguration technique tailored for CGRAs, relaxing the conventional scratchpad memory (SPM) assumption of full data residency. By co-optimizing microarchitecture and memory modeling, our approach achieves performance comparable to full-SPM systems using only 1.27% additional storage overhead. Experimental results demonstrate an average speedup of 3.04× (up to 6.91×), with dynamic cache reconfiguration further improving performance by 6.02%. This significantly alleviates memory bottlenecks and establishes a new paradigm for efficient CGRA execution under irregular memory access scenarios.
📝 Abstract
Coarse-Grained Reconfigurable Arrays (CGRAs) are specialized accelerators commonly employed to boost performance in workloads with iterative structures. Existing research typically focuses on compiler or architecture optimizations aimed at improving CGRA performance, energy efficiency, flexibility, and area utilization, under the idealistic assumption that kernels can access all data from Scratchpad Memory (SPM). However, certain complex workloads-particularly in fields like graph analytics, irregular database operations, and specialized forms of high-performance computing (e.g., unstructured mesh simulations)-exhibit irregular memory access patterns that hinder CGRA utilization, sometimes dropping below 1.5%, making the CGRA memory-bound. To address this challenge, we conduct a thorough analysis of the underlying causes of performance degradation, then propose a redesigned memory subsystem and refine the memory model. With both microarchitectural and theoretical optimization, our solution can effectively manage irregular memory accesses through CGRA-specific runahead execution mechanism and cache reconfiguration techniques. Our results demonstrate that we can achieve performance comparable to the original SPM-only system while requiring only 1.27% of the storage size. The runahead execution mechanism achieves an average 3.04x speedup (up to 6.91x), with cache reconfiguration technique providing an additional 6.02% improvement, significantly enhancing CGRA performance for irregular memory access patterns.