🤖 AI Summary
This work addresses the disjoint optimization of quantization and hardware mapping in CNN accelerators, where energy efficiency and memory constraints are tightly coupled. We propose a quantization–mapping co-design methodology that jointly optimizes weight quantization policies and hardware mapping (scheduling and resource allocation) under multiple objectives. We extend the Timeloop framework to support mixed-precision quantization modeling and introduce a layer-wise adaptive bit-width–mapping co-search algorithm. Evaluated on Eyeriss and Simba architectures with MobileNetV1/V2, our approach achieves up to 37% energy reduction on ImageNet with zero accuracy loss, significantly expanding the Pareto frontier across energy efficiency, accuracy, and memory usage. Our core contribution is the identification of a novel, high-efficiency mapping space enabled by mixed-precision quantization and the development of the first open-source toolchain supporting quantization-aware, joint mapping optimization.
📝 Abstract
Energy efficiency and memory footprint of a convolutional neural network (CNN) implemented on a CNN inference accelerator depend on many factors, including a weight quantization strategy (i.e., data types and bit-widths) and mapping (i.e., placement and scheduling of DNN elementary operations on hardware units of the accelerator). We show that enabling rich mixed quantization schemes during the implementation can open a previously hidden space of mappings that utilize the hardware resources more effectively. CNNs utilizing quantized weights and activations and suitable mappings can significantly improve trade-offs among the accuracy, energy, and memory requirements compared to less carefully optimized CNN implementations. To find, analyze, and exploit these mappings, we: (i) extend a general-purpose state-of-the-art mapping tool (Timeloop) to support mixed quantization, which is not currently available; (ii) propose an efficient multi-objective optimization algorithm to find the most suitable bit-widths and mapping for each DNN layer executed on the accelerator; and (iii) conduct a detailed experimental evaluation to validate the proposed method. On two CNNs (MobileNetV1 and MobileNetV2) and two accelerators (Eyeriss and Simba) we show that for a given quality metric (such as the accuracy on ImageNet), energy savings are up to 37% without any accuracy drop.