Pluto: A Benchmark for Evaluating Efficiency of LLM-generated Hardware Code

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based hardware code generation benchmarks evaluate only functional correctness, lacking systematic assessment of synthesizable efficiency metrics—such as area, latency, and power—and omitting optimization baselines and self-verifying test environments. Method: We introduce Pluto, a comprehensive benchmark comprising 114 design tasks with self-checking testbenches, integrated Pareto-optimal reference implementations, and the first multi-dimensional evaluation framework targeting energy efficiency. Pluto enables end-to-end synthesis verification and quantitative comparison of LLM-generated RTL code. Results: Experiments reveal that state-of-the-art models achieve only 78.3% functional pass rate (@pass@1), while their area, latency, and power efficiency stand at merely 63.8%, 65.9%, and 64.0% (@eff@1), respectively—highlighting critical deficiencies in hardware efficiency. Pluto establishes the first reproducible, scalable evaluation infrastructure for hardware-aware LLM research.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly used to automate hardware design tasks, including the generation of Verilog code. While early benchmarks focus primarily on functional correctness, efficient hardware design demands additional optimization for synthesis metrics such as area, delay, and power. Existing benchmarks fall short in evaluating these aspects comprehensively: they often lack optimized baselines or testbenches for verification. To address these gaps, we present Pluto, a benchmark and evaluation framework designed to assess the efficiency of LLM-generated Verilog designs. Pluto presents a comprehensive evaluation set of 114 problems with self-checking testbenches and multiple Pareto-optimal reference implementations. Experimental results show that state-of-the-art LLMs can achieve high functional correctness, reaching 78.3% at pass@1, but their synthesis efficiency still lags behind expert-crafted implementations, with area efficiency of 63.8%, delay efficiency of 65.9%, and power efficiency of 64.0% at eff@1. This highlights the need for efficiency-aware evaluation frameworks such as Pluto to drive progress in hardware-focused LLM research.
Problem

Research questions and friction points this paper is trying to address.

Evaluating synthesis efficiency of LLM-generated Verilog code
Addressing gaps in existing hardware design benchmarks
Assessing area, delay, and power optimization in generated code
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pluto benchmark evaluates LLM-generated hardware code efficiency
Includes 114 problems with self-checking testbenches for verification
Provides Pareto-optimal reference implementations for synthesis optimization
🔎 Similar Papers
No similar papers found.
M
Manar Abdelatty
Department of Electrical and Computer Engineering, Brown University, Providence, RI 02906, USA
M
Maryam Nouh
Department of Electrical and Computer Engineering, Brown University, Providence, RI 02906, USA
J
Jacob K. Rosenstein
Department of Electrical and Computer Engineering, Brown University, Providence, RI 02906, USA
Sherief Reda
Sherief Reda
Professor, Brown University | Amazon Scholar | IEEE Fellow
Energy-Efficient ComputingDesign AutomationEmbedded SystemsMolecular Computing