🤖 AI Summary
This study investigates the impact of parallel scaling on large language models’ (LLMs) performance in automated Verilog code generation. Addressing the limited functional correctness of LLM-generated hardware designs—attributed to output randomness and single-sample decoding bias—we propose a lightweight, fine-tuning-free, post-training-free parallel sampling method that generates multiple candidate solutions concurrently. We systematically evaluate this approach across mainstream open-weight LLMs (e.g., Llama-3, Qwen2) and established benchmarks (VerilogEval, HDLBits), demonstrating a strong positive correlation between parallel scale (up to hundreds of samples) and generation quality. Crucially, we identify and empirically validate the mechanism by which statistical aggregation over parallel samples suppresses stochasticity and improves functional correctness. Experiments show that our method achieves substantial gains under controllable latency and computational cost, outperforming state-of-the-art LLM-based Verilog generators—with up to a 27.4% absolute improvement in functional correctness.
📝 Abstract
We present VerilogMonkey, an empirical study of parallel scaling for the under-explored task of automated Verilog generation. Parallel scaling improves LLM performance by sampling many outputs in parallel. Across multiple benchmarks and mainstream LLMs, we find that scaling to hundreds of samples is cost-effective in both time and money and, even without any additional enhancements such as post-training or agentic methods, surpasses prior results on LLM-based Verilog generation. We further dissect why parallel scaling delivers these gains and show how output randomness in LLMs affects its effectiveness.