CodeV: Empowering LLMs with HDL Generation through Multi-Level Summarization

๐Ÿ“… 2024-07-15
๐Ÿ“ˆ Citations: 22
โœจ Influential: 9
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing HDL generation models suffer from low-quality training data and limited support for diverse programming languages and tasks. To address these limitations, this paper proposes a novel methodology for constructing large language models (LLMs) tailored to processor design: leveraging authentic, high-quality Verilog and Chisel code, we devise a multi-level abstract data synthesis strategy and introduce the first โ€œChat-FIM-Tagโ€ three-stage joint supervised fine-tuning paradigm, augmented with language identifier injection and few-shot label-guided learning. The resulting model, CodeV-All, is the first open-source HDL-specific LLM supporting both Verilog and Chiselโ€”and concurrently enabling both conversational (Chat) and fill-in-the-middle (FIM) generation tasks. Evaluated on the VerilogEval benchmark, CodeV-All achieves performance on par with or superior to monolingual fine-tuned baselines, marking the first demonstration of unified, efficient multi-language and multi-task HDL code generation.

Technology Category

Application Category

๐Ÿ“ Abstract
The design flow of processors, particularly in hardware description languages (HDL) like Verilog and Chisel, is complex and costly. While recent advances in large language models (LLMs) have significantly improved coding tasks in software languages such as Python, their application in HDL generation remains limited due to the scarcity of high-quality HDL data. Traditional methods of adapting LLMs for hardware design rely on synthetic HDL datasets, which often suffer from low quality because even advanced LLMs like GPT perform poorly in the HDL domain. Moreover, these methods focus solely on chat tasks and the Verilog language, limiting their application scenarios. In this paper, we observe that: (1) HDL code collected from the real world is of higher quality than code generated by LLMs. (2) LLMs like GPT-3.5 excel in summarizing HDL code rather than generating it. (3) An explicit language tag can help LLMs better adapt to the target language when there is insufficient data. Based on these observations, we propose an efficient LLM fine-tuning pipeline for HDL generation that integrates a multi-level summarization data synthesis process with a novel Chat-FIM-Tag supervised fine-tuning method. The pipeline enhances the generation of HDL code from natural language descriptions and enables the handling of various tasks such as chat and infilling incomplete code. Utilizing this pipeline, we introduce CodeV, a series of HDL generation LLMs. Among them, CodeV-All not only possesses a more diverse range of language abilities, i.e. Verilog and Chisel, and a broader scope of tasks, i.e. Chat and fill-in-middle (FIM), but it also achieves performance on VerilogEval that is comparable to or even surpasses that of CodeV-Verilog fine-tuned on Verilog only, making them the first series of open-source LLMs designed for multi-scenario HDL generation.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited LLM application in HDL generation due to data scarcity
Improving HDL code quality by leveraging real-world data over synthetic datasets
Enhancing LLM adaptability for multi-language and multi-task HDL generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-level summarization enhances HDL data synthesis
Chat-FIM-Tag method improves LLM fine-tuning for HDL
CodeV supports Verilog and Chisel with diverse tasks
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yang Zhao
SKL of Processors, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences
D
Di Huang
SKL of Processors, Institute of Computing Technology, CAS
Chongxiao Li
Chongxiao Li
ICT, CAS
Computer Architecture
P
Pengwei Jin
SKL of Processors, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences
Z
Ziyuan Nan
SKL of Processors, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences
T
Tianyun Ma
SKL of Processors, Institute of Computing Technology, CAS; University of Science and Technology of China
L
Lei Qi
SKL of Processors, Institute of Computing Technology, CAS; University of Science and Technology of China
Y
Yansong Pan
SKL of Processors, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences
Zhenxing Zhang
Zhenxing Zhang
School of computing, Dublin City University
machine learningcomputer visioninformation retrieval
R
Rui Zhang
SKL of Processors, Institute of Computing Technology, CAS
Xishan Zhang
Xishan Zhang
Institute of Computing Technology of the Chinese Academy of Sciences
Z
Zidong Du
SKL of Processors, Institute of Computing Technology, CAS
Q
Qi Guo
SKL of Processors, Institute of Computing Technology, CAS
X
Xing Hu
SKL of Processors, Institute of Computing Technology, CAS
Yunji Chen
Yunji Chen
Institute of Computing Technology, Chinese Academy of Sciences
processor architecturemicroarchitecturemachine learning