Intent2Tx: Benchmarking LLMs for Translating Natural Language Intents into Ethereum Transactions

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing benchmarks struggle to effectively evaluate the ability of large language models (LLMs) to translate natural language intents into functionally correct, state-dependent Ethereum transactions. This work proposes Intent2Tx—the first high-fidelity benchmark constructed from 300 days of real-world Ethereum mainnet traces—encompassing both single-step and multi-step transactions across 11 long-tail DeFi scenarios. We introduce an execution-aware evaluation framework based on forked chains, enabling differential state analysis to verify transaction correctness beyond superficial text matching. Evaluation of 16 prominent LLMs reveals significant shortcomings in out-of-distribution generalization and multi-step planning, with syntactically valid outputs frequently failing to achieve the intended state transitions.

📝 Abstract

The emergence of Large Language Models (LLMs) offers a transformative interface for Web3, yet existing benchmarks fail to capture the complexity of translating high-level user intents into functionally correct, state-dependent on-chain transactions. We present \textsc{Intent2Tx}, a high-fidelity benchmark featuring 29,921 single-step and 1,575 multi-step instances meticulously derived from 300 days of real-world Ethereum mainnet traces. Unlike prior works that rely on synthetic instructions, \textsc{Intent2Tx} grounds natural language intents in real-world protocol interactions across 11 categories, including diverse long-tail Decentralized Finance (DeFi) primitives. To enable rigorous evaluation, we propose an execution-aware framework that transcends surface-level text matching by employing differential state analysis on forked mainnet environments. Our extensive evaluation of 16 state-of-the-art LLMs reveals that while scaling and retrieval-augmentation enhance logical consistency and parameter precision, current models struggle with out-of-distribution generalization and multi-step planning. Crucially, our execution-based analysis demonstrates that syntactically valid outputs often fail to achieve intended state transitions, highlighting a significant gap in current "reasoning-to-execution" capabilities. \textsc{Intent2Tx} serves as a critical foundation for developing autonomous, reliable agents in intent-centric Web3 ecosystems. Code and data: https://anonymous.4open.science/r/Intent2Tx_Bench-97FF .

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Ethereum Transactions

Intent Translation

Web3

Benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Intent2Tx

execution-aware evaluation

Ethereum transaction synthesis