ScaleRTL: Scaling LLMs with Reasoning Data and Test-Time Compute for Accurate RTL Code Generation

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the dual bottlenecks of scarce high-quality training data and limited reasoning capability in RTL code generation, this paper introduces the first reasoning-oriented large language model specifically designed for hardware description languages. Methodologically: (1) we construct the first long-chain reasoning dataset for the RTL domain, comprising 3.5 billion tokens with an average trace length of 56K tokens; (2) we propose a test-time iterative self-reflection and self-correction mechanism, synergistically integrating chain-of-thought (CoT) data construction and test-time scaling to enhance inference. Our key contribution is the systematic introduction of long-chain reasoning to RTL code generation—achieving state-of-the-art performance by outperforming 18 strong baselines by 18.4% on VerilogEval and 12.7% on RTLLM.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have enabled near-human performance on software coding benchmarks, but their effectiveness in RTL code generation remains limited due to the scarcity of high-quality training data. While prior efforts have fine-tuned LLMs for RTL tasks, they do not fundamentally overcome the data bottleneck and lack support for test-time scaling due to their non-reasoning nature. In this work, we introduce ScaleRTL, the first reasoning LLM for RTL coding that scales up both high-quality reasoning data and test-time compute. Specifically, we curate a diverse set of long chain-of-thought reasoning traces averaging 56K tokens each, resulting in a dataset of 3.5B tokens that captures rich RTL knowledge. Fine-tuning a general-purpose reasoning model on this corpus yields ScaleRTL that is capable of deep RTL reasoning. Subsequently, we further enhance the performance of ScaleRTL through a novel test-time scaling strategy that extends the reasoning process via iteratively reflecting on and self-correcting previous reasoning steps. Experimental results show that ScaleRTL achieves state-of-the-art performance on VerilogEval and RTLLM, outperforming 18 competitive baselines by up to 18.4% on VerilogEval and 12.7% on RTLLM.

Problem

Research questions and friction points this paper is trying to address.

Addressing limited RTL code generation in LLMs due to data scarcity

Overcoming data bottleneck with reasoning-based fine-tuning for RTL tasks

Enhancing RTL reasoning via test-time compute and self-correction strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reasoning data for RTL code generation

Applies test-time compute scaling strategy

Enhances performance via iterative self-correction

🔎 Similar Papers

No similar papers found.