LLM-FSM: Scaling Large Language Models for Finite-State Reasoning in RTL Code Generation

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work investigates the capability of large language models (LLMs) to recover finite-state machine (FSM) behavior from natural language descriptions and generate correct register-transfer level (RTL) code. To this end, we introduce the first fully automated and scalable FSM-to-RTL benchmark, comprising over a thousand test cases, with data quality ensured through a structured YAML intermediate representation, formal verification via SAT solvers, and manual validation. Our experiments demonstrate that supervised fine-tuning substantially improves out-of-distribution generalization, while test-time compute scaling enhances reasoning reliability. Nevertheless, even the most advanced LLMs exhibit a significant drop in accuracy on complex FSM tasks, revealing fundamental limitations in current models’ ability to reason about hardware semantics.

Technology Category

Application Category

📝 Abstract

Finite-state reasoning, the ability to understand and implement state-dependent behavior, is central to hardware design. In this paper, we present LLM-FSM, a benchmark that evaluates how well large language models (LLMs) can recover finite-state machine (FSM) behavior from natural-language specifications and translate it into correct register transfer-level (RTL) implementations. Unlike prior specification-to-RTL benchmarks that rely on manually constructed examples, LLM-FSM is built through a fully automated pipeline. LLM-FSM first constructs FSM with configurable state counts and constrained transition structures. It then prompts LLMs to express each FSM in a structured YAML format with an application context, and to further convert that YAML into a natural-language (NL) specification. From the same YAML, our pipeline synthesizes the reference RTL and testbench in a correct-by-construction manner. All 1,000 problems are verified using LLM-based and SAT-solver-based checks, with human review on a subset. Our experiments show that even the strongest LLMs exhibit sharply declining accuracy as FSM complexity increases. We further demonstrate that training-time scaling via supervised fine-tuning (SFT) generalizes effectively to out-of-distribution (OOD) tasks, while increasing test-time compute improves reasoning reliability. Finally, LLM-FSM remains extensible by allowing its FSM complexity to scale with future model capabilities.

Problem

Research questions and friction points this paper is trying to address.

finite-state reasoning

large language models

RTL code generation

FSM

natural-language specification

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-FSM

finite-state reasoning

RTL code generation