🤖 AI Summary
Current large language models (LLMs) lack standardized evaluation for hardware description language (HDL) code generation, particularly for synthesizable, functionally correct communication protocol implementations. Method: We introduce the first protocol-level RTL generation benchmark targeting SPI, I²C, UART, and AXI protocols, featuring multi-abstraction-level generation tasks and a rigorous synthesis-readiness validation pipeline—including syntax checking, logic synthesis, and UVM-driven waveform simulation. Contribution/Results: Evaluating 12 prominent open- and closed-weight LLMs, we find only two models pass all functional correctness checks, with an average synthesis success rate below 35%. Results reveal pervasive deficiencies in protocol-specific timing modeling and concurrent control handling. This benchmark fills a critical gap in evaluating LLMs for digital circuit protocol implementation and establishes a new evaluation paradigm for HDL code generation capability.
📝 Abstract
Recent advances in Large Language Models (LLMs) have shown promising capabilities in generating code for general-purpose programming languages. In contrast, their applicability for hardware description languages, particularly for generating synthesizable and functionally correct designs, remains significantly underexplored. HDLs such as SystemVerilog are logic-oriented and demand strict adherence to timing semantics, concurrency, and synthesizability constraints. Moreover, HDL-based design flows encompass a broad set of tasks beyond structural code generation, including testbench development, assertion-based verification, timing closure, and protocol-level integration for on-chip communication. The objective of our paper is to analyze the capabilities of state-of-the-art LLMs in generating SystemVerilog implementations of standard communication protocols, a core component of embedded and System-on-Chip (SoC) architectures. This paper introduces the first benchmark suite targeting four widely used protocols: SPI, I2C, UART, and AXI. We define code generation tasks that capture varying levels of design abstraction and prompt specificity. The generated designs are assessed for syntactic correctness, synthesizability, and functional fidelity via waveform simulation and test benches.