EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation

📅 2026-01-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Current large language models lack rigorous evaluation of accuracy and security when generating on-chain transaction code based on actual execution outcomes, often leading to irreversible losses. This work proposes the first benchmark for evaluating natural language–to–transaction script generation on EVM-compatible blockchains. The benchmark enables safe and realistic assessment of both atomic and composite tasks through dynamic instruction and parameter sampling, forked-chain execution validation, snapshot isolation, and a step-wise efficiency decay mechanism. Encompassing 107 distinct tasks and evaluating 20 models, the study reveals a significant performance gap between single-action accuracy and multi-step task completion rates, thereby establishing a reliable evaluation framework for smart contract generation research.

Technology Category

Application Category

📝 Abstract

Large language models are increasingly applied to various development scenarios. However, in on-chain transaction scenarios, even a minor error can cause irreversible loss for users. Existing evaluations often overlook execution accuracy and safety. We introduce EVM-QuestBench, an execution-grounded benchmark for natural-language transaction-script generation on EVM-compatible chains. The benchmark employs dynamic evaluation: instructions are sampled from template pools, numeric parameters are drawn from predefined intervals, and validators verify outcomes against these instantiated values. EVM-QuestBench contains 107 tasks (62 atomic, 45 composite). Its modular architecture enables rapid task development. The runner executes scripts on a forked EVM chain with snapshot isolation; composite tasks apply step-efficiency decay. We evaluate 20 models and find large performance gaps, with split scores revealing persistent asymmetry between single-action precision and multi-step workflow completion. Code: https://anonymous.4open.science/r/bsc_quest_bench-A9CF/.

Problem

Research questions and friction points this paper is trying to address.

execution accuracy

safety

natural-language transaction code

EVM-compatible chains

benchmark evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

execution-grounded benchmark

EVM-compatible chains

dynamic evaluation