🤖 AI Summary
This work addresses the challenge of efficiently finding optimal mapping strategies for deep neural network accelerators, where mappings critically impact energy efficiency and latency yet existing approaches fail to converge to optimality within practical timeframes. The paper proposes Turbo-Charged Mapper (TCM), which introduces the novel concept of “data placement” to expose structural redundancies in the mapping space. By integrating formal analysis with aggressive pruning techniques, TCM achieves a search space reduction of up to 10³²-fold. This enables exhaustive traversal of the pruned space within one minute while rigorously guaranteeing optimality. In contrast, state-of-the-art methods—even when allowed over ten hours of runtime—fail to converge and yield suboptimal mappings with 21% higher energy-delay product (EDP).
📝 Abstract
The energy and latency of an accelerator running a deep neural network (DNN) depend on how the computation and data movement are scheduled in the accelerator (i.e., mapping). Optimizing mappings is essential to evaluating and designing accelerators. However, the space of mappings is large, and prior works can not guarantee finding optimal mappings because they use heuristics or metaheuristics to narrow down the space. These limitations preclude proper hardware evaluation, since designers can not tell whether performance differences are due to changes in hardware or suboptimal mapping.
To address this challenge, we propose the Turbo-Charged Mapper (TCM), a fast mapper that is guaranteed to find optimal mappings. The key to our approach is that we define a new concept in mapping, called dataplacement, which, like the prior concept of dataflow, allows for clear analysis and comparison of mappings. Through it, we identify multiple opportunities to prune redundant and suboptimal mappings, reducing search space by up to 32 orders of magnitude.
Leveraging these insights, TCM can perform full mapspace searches, making it the first mapper that can find optimal mappings in feasible runtime. Compared to prior mappers, we show that TCM can find optimal mappings quickly (less than a minute), while prior works can not find optimal mappings (energy-delay-product $21\%$ higher than optimal) even when given $1000\times$ the runtime ($>10$ hours).