🤖 AI Summary
This work addresses the challenge of automatically generating executable smart contract code from natural-language business process descriptions using large language models (LLMs), with strict requirements on semantic consistency for process control, resource allocation, and data-driven conditions.
Method: We propose an end-to-end code generation and verification pipeline spanning multiple LLM scales and architectures, coupled with the first automated evaluation framework tailored to smart contract functional properties—systematically assessing execution reliability on a large-scale process modeling dataset, moving beyond manual inspection or syntactic compilation checks.
Contribution/Results: Experiments reveal that current LLMs fall significantly short of production-grade reliability in generated contracts. The study underscores the necessity of responsibility-aware integration—combining LLMs with formal verification tools and workflow engines. Our work establishes the first reproducible, multi-dimensional empirical benchmark for trustworthy AI-assisted smart contract development.
📝 Abstract
Large language models (LLMs) have changed the reality of how software is produced. Within the wider software engineering community, among many other purposes, they are explored for code generation use cases from different types of input. In this work, we present an exploratory study to investigate the use of LLMs for generating smart contract code from business process descriptions, an idea that has emerged in recent literature to overcome the limitations of traditional rule-based code generation approaches. However, current LLM-based work evaluates generated code on small samples, relying on manual inspection, or testing whether code compiles but ignoring correct execution. With this work, we introduce an automated evaluation framework and provide empirical data from larger data sets of process models. We test LLMs of different types and sizes in their capabilities of achieving important properties of process execution, including enforcing process flow, resource allocation, and data-based conditions. Our results show that LLM performance falls short of the perfect reliability required for smart contract development. We suggest future work to explore responsible LLM integrations in existing tools for code generation to ensure more reliable output. Our benchmarking framework can serve as a foundation for developing and evaluating such integrations.