Intent2Tx: Benchmarking LLMs for Translating Natural Language Intents into Ethereum Transactions

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
Existing benchmarks struggle to effectively evaluate the ability of large language models (LLMs) to translate natural language intents into functionally correct, state-dependent Ethereum transactions. This work proposes Intent2Tx—the first high-fidelity benchmark constructed from 300 days of real-world Ethereum mainnet traces—encompassing both single-step and multi-step transactions across 11 long-tail DeFi scenarios. We introduce an execution-aware evaluation framework based on forked chains, enabling differential state analysis to verify transaction correctness beyond superficial text matching. Evaluation of 16 prominent LLMs reveals significant shortcomings in out-of-distribution generalization and multi-step planning, with syntactically valid outputs frequently failing to achieve the intended state transitions.
📝 Abstract
The emergence of Large Language Models (LLMs) offers a transformative interface for Web3, yet existing benchmarks fail to capture the complexity of translating high-level user intents into functionally correct, state-dependent on-chain transactions. We present \textsc{Intent2Tx}, a high-fidelity benchmark featuring 29,921 single-step and 1,575 multi-step instances meticulously derived from 300 days of real-world Ethereum mainnet traces. Unlike prior works that rely on synthetic instructions, \textsc{Intent2Tx} grounds natural language intents in real-world protocol interactions across 11 categories, including diverse long-tail Decentralized Finance (DeFi) primitives. To enable rigorous evaluation, we propose an execution-aware framework that transcends surface-level text matching by employing differential state analysis on forked mainnet environments. Our extensive evaluation of 16 state-of-the-art LLMs reveals that while scaling and retrieval-augmentation enhance logical consistency and parameter precision, current models struggle with out-of-distribution generalization and multi-step planning. Crucially, our execution-based analysis demonstrates that syntactically valid outputs often fail to achieve intended state transitions, highlighting a significant gap in current "reasoning-to-execution" capabilities. \textsc{Intent2Tx} serves as a critical foundation for developing autonomous, reliable agents in intent-centric Web3 ecosystems. Code and data: https://anonymous.4open.science/r/Intent2Tx_Bench-97FF .
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Ethereum Transactions
Intent Translation
Web3
Benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Intent2Tx
execution-aware evaluation
Ethereum transaction synthesis
differential state analysis
LLM benchmarking
🔎 Similar Papers