🤖 AI Summary
Existing benchmarks struggle to effectively evaluate the ability of large language models (LLMs) to translate natural language intents into functionally correct, state-dependent Ethereum transactions. This work proposes Intent2Tx—the first high-fidelity benchmark constructed from 300 days of real-world Ethereum mainnet traces—encompassing both single-step and multi-step transactions across 11 long-tail DeFi scenarios. We introduce an execution-aware evaluation framework based on forked chains, enabling differential state analysis to verify transaction correctness beyond superficial text matching. Evaluation of 16 prominent LLMs reveals significant shortcomings in out-of-distribution generalization and multi-step planning, with syntactically valid outputs frequently failing to achieve the intended state transitions.
📝 Abstract
The emergence of Large Language Models (LLMs) offers a transformative interface for Web3, yet existing benchmarks fail to capture the complexity of translating high-level user intents into functionally correct, state-dependent on-chain transactions. We present \textsc{Intent2Tx}, a high-fidelity benchmark featuring 29,921 single-step and 1,575 multi-step instances meticulously derived from 300 days of real-world Ethereum mainnet traces. Unlike prior works that rely on synthetic instructions, \textsc{Intent2Tx} grounds natural language intents in real-world protocol interactions across 11 categories, including diverse long-tail Decentralized Finance (DeFi) primitives. To enable rigorous evaluation, we propose an execution-aware framework that transcends surface-level text matching by employing differential state analysis on forked mainnet environments. Our extensive evaluation of 16 state-of-the-art LLMs reveals that while scaling and retrieval-augmentation enhance logical consistency and parameter precision, current models struggle with out-of-distribution generalization and multi-step planning. Crucially, our execution-based analysis demonstrates that syntactically valid outputs often fail to achieve intended state transitions, highlighting a significant gap in current "reasoning-to-execution" capabilities. \textsc{Intent2Tx} serves as a critical foundation for developing autonomous, reliable agents in intent-centric Web3 ecosystems. Code and data: https://anonymous.4open.science/r/Intent2Tx_Bench-97FF .