🤖 AI Summary
Existing LLM-based HDL generation benchmarks evaluate only functional correctness, neglecting critical FPGA constraints—particularly hardware resource efficiency (e.g., LUT utilization)—and suffer from narrow scenario coverage, limiting their ability to distinguish models’ resource optimization capabilities.
Method: We propose the first resource-efficiency–oriented benchmark for LLM-generated HDL: it encompasses 56 real-world FPGA designs across 12 application categories; introduces LUT, FF, and BRAM utilization as primary evaluation metrics; and establishes a scalable, resource-aware evaluation framework integrating Xilinx Vivado synthesis and implementation flows with automated comparative pipelines.
Results: Experiments reveal substantial variation in LUT usage among state-of-the-art LLMs—up to 3.2×—demonstrating the benchmark’s strong discriminative power and practical utility for assessing and advancing resource-aware HDL generation.
📝 Abstract
Field-Programmable Gate Arrays (FPGAs) are widely used in modern hardware design, yet writing Hardware Description Language (HDL) code for FPGA implementation remains labor-intensive and complex. Large Language Models (LLMs) have emerged as a promising tool for automating HDL generation, but existing benchmarks for LLM HDL code generation primarily evaluate functional correctness while overlooking the critical aspect of hardware resource efficiency. Moreover, current benchmarks lack diversity, failing to capture the broad range of real-world FPGA applications. To address these gaps, we introduce ResBench, the first resource-oriented benchmark explicitly designed to differentiate between resource-optimized and inefficient LLM-generated HDL. ResBench consists of 56 problems across 12 categories, covering applications from finite state machines to financial computing. Our evaluation framework systematically integrates FPGA resource constraints, with a primary focus on Lookup Table (LUT) usage, enabling a realistic assessment of hardware efficiency. Experimental results reveal substantial differences in resource utilization across LLMs, demonstrating ResBench's effectiveness in distinguishing models based on their ability to generate resource-optimized FPGA designs.