๐ค AI Summary
Hardware accelerator design space exploration (DSE) faces challenges of enormous scale (O(10ยนโท)), non-convexity, many-to-one mappings, and non-differentiability. To address these, this work pioneers the application of diffusion models to hardware design generation, proposing a condition-driven 1D image synthesis framework that formulates architecture generation as inverse performance mapping learning. Unlike gradient-based methods, our approach eliminates reliance on initialization sensitivity and differentiability assumptions, enabling large-scale unstructured search. By integrating structured sampling with conditional generative networks, it achieves end-to-end, performance-guided architecture synthesis. Experiments demonstrate a 1312ร speedup in search time, a 30% reduction in generation error, a 9.8% improvement in energy-delay product (EDP), and LLM inference energy efficiency 7.75ร higher than DOSA.
๐ Abstract
Design space exploration (DSE) is critical for developing optimized hardware architectures, especially for AI workloads such as deep neural networks (DNNs) and large language models (LLMs), which require specialized acceleration. As model complexity grows, accelerator design spaces have expanded to O(10^17), becoming highly irregular, non-convex, and exhibiting many-to-one mappings from design configurations to performance metrics. This complexity renders direct inverse derivation infeasible and necessitates heuristic or sampling-based optimization. Conventional methods - including Bayesian optimization, gradient descent, reinforcement learning, and genetic algorithms - depend on iterative sampling, resulting in long runtimes and sensitivity to initialization. Deep learning-based approaches have reframed DSE as classification using recommendation models, but remain limited to small-scale (O(10^3)), less complex design spaces. To overcome these constraints, we propose a generative approach that models hardware design as 1-D image synthesis conditioned on target performance, enabling efficient learning of non-differentiable, non-bijective hardware-performance mappings. Our framework achieves 0.86% lower generation error than Bayesian optimization with a 17000x speedup, and outperforms GANDSE with 30% lower error at only 1.83x slower search. We further extend the method to a structured DSE setting, attaining 9.8% lower energy-delay product (EDP) and 6% higher performance, with up to 145.6x and 1312x faster search compared to existing optimization methods on O(10^17) design spaces. For LLM inference, our method achieves 3.37x and 7.75x lower EDP on a 32nm ASIC and Xilinx Ultrascale+ VPU13 FPGA, respectively, compared to the state-of-the-art DOSA framework.