🤖 AI Summary
Current TPU design faces bottlenecks including heavy reliance on expert knowledge, high manual labor costs, and scarcity of domain-specific training data. To address these challenges, this paper proposes the first LLM-based automated TPU generation framework, specifically targeting systolic array architectures and enabling end-to-end hardware synthesis—from high-level specifications to synthesizable RTL. Key contributions are: (1) the first open-source, high-quality hardware-specific dataset for training and evaluation; (2) a hardware-semantic-aware RAG mechanism that substantially mitigates LLM hallucination; and (3) integrated modeling of systolic arrays, optimized approximate multiply-accumulate units, and automated hardware pipelining. Experimental results demonstrate that the generated TPUs achieve, on average, 92% reduction in area and 96% reduction in power consumption compared to manually optimized baselines, with significant improvements in energy efficiency and overall PPA (performance–power–area).
📝 Abstract
The increasing complexity and scale of Deep Neural Networks (DNNs) necessitate specialized tensor accelerators, such as Tensor Processing Units (TPUs), to meet various computational and energy efficiency requirements. Nevertheless, designing optimal TPU remains challenging due to the high domain expertise level, considerable manual design time, and lack of high-quality, domain-specific datasets. This paper introduces TPU-Gen, the first Large Language Model (LLM) based framework designed to automate the exact and approximate TPU generation process, focusing on systolic array architectures. TPU-Gen is supported with a meticulously curated, comprehensive, and open-source dataset that covers a wide range of spatial array designs and approximate multiply-and-accumulate units, enabling design reuse, adaptation, and customization for different DNN workloads. The proposed framework leverages Retrieval-Augmented Generation (RAG) as an effective solution for a data-scare hardware domain in building LLMs, addressing the most intriguing issue, hallucinations. TPU-Gen transforms high-level architectural specifications into optimized low-level implementations through an effective hardware generation pipeline. Our extensive experimental evaluations demonstrate superior performance, power, and area efficiency, with an average reduction in area and power of 92% and 96% from the manual optimization reference values. These results set new standards for driving advancements in next-generation design automation tools powered by LLMs.