๐ค AI Summary
Existing LLM-based RTL generation methods struggle to simultaneously ensure functional correctness and hardware quality (PPA). Supervised fine-tuning often yields functionally correct but PPA-suboptimal code, while post-processing techniques suffer from low efficiency due to their inability to update model parameters. This paper proposes a hierarchical reward-driven reinforcement learning framework that unifies syntactic validity, functional correctness, and PPA metrics into multi-level reward signals. By tightly coupling RTL simulators and synthesis tools, the framework establishes a closed-loop feedback system, enabling the LLM to autonomously learn hardware design trade-offs during training. Experiments demonstrate state-of-the-art functional correctness on both VerilogEval and RTLLM benchmarks. Notably, on RTLLM, our method achieves superior PPA over human-designed implementations in 27 out of 40 casesโmarking the first instance where LLM-generated RTL surpasses human designers in both functional correctness and PPA.
๐ Abstract
Large Language Models (LLMs) show significant potential for automating Register-Transfer Level (RTL) code generation. However, current approaches face a critical challenge: they can not simultaneously optimize for functional correctness and hardware quality (Power, Performance, Area - PPA). Methods based on supervised fine-tuning often generate functionally correct but PPA-suboptimal code, lacking mechanisms to learn optimization principles. In contrast, post-processing techniques that attempt to improve PPA metrics after generation are often inefficient because they operate externally without updating the LLM's parameters, thus failing to enhance the model's intrinsic design capabilities.
To bridge this gap, we introduce ChipSeek-R1, a hierarchical reward-driven reinforcement learning framework to train LLMs to generate RTL code that achieves both functional correctness and optimized PPA metrics. ChipSeek-R1 employs a hierarchical reward system, which incorporates direct feedback on syntax, functional correctness (from simulators) and PPA metrics (from synthesis tools) during reinforcement learning. This enables the model to learn complex hardware design trade-offs via trial-and-error, generating RTL code that is both functionally correct and PPA-optimized. Evaluating ChipSeek-R1 on standard benchmarks (VerilogEval, RTLLM), we achieve state-of-the-art results in functional correctness. Notably, on the RTLLM benchmark, ChipSeek-R1 generated 27 RTL designs surpassing the PPA metrics of the original human-written code. Our findings demonstrate the effectiveness of integrating toolchain feedback into LLM training and highlight the potential for reinforcement learning to enable automated generation of human-surpassing RTL code. We open-source our code in anonymous github.