MMInA: Benchmarking Multihop Multimodal Internet Agents

📅 2024-04-15
🏛️ arXiv.org
📈 Citations: 22
Influential: 2
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately evaluate autonomous agents’ multi-hop, cross-site navigation capabilities in realistic, dynamically evolving multimodal web environments. To address this, we propose WebHopBench—the first benchmark for evaluating internet agents on real-world, evolving multimodal websites—comprising 1,050 composite user tasks spanning e-commerce, travel, and other domains. Our method introduces a novel multi-hop, multimodal evaluation paradigm grounded in live, dynamic websites; designs a stage-wise success metric to quantify task progression; and identifies early-hop failure as the primary bottleneck. Leveraging this insight, we propose a trajectory replay memory augmentation technique. Experiments demonstrate significant improvements in both single-hop and multi-hop success rates, revealing that state-of-the-art models suffer from high early-failure rates in long-horizon tasks. The benchmark dataset and code are publicly released.

Technology Category

Application Category

📝 Abstract
Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: 1) Evolving real-world multimodal websites. Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to autonomously extract multimodal information from web pages as observations; 2) Multihop web browsing. Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation. We propose a novel protocol for evaluating an agent's progress in completing multihop tasks. We experiment with both standalone (multimodal) language models and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks of more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach replaying past action trajectories to reflect. Our method significantly improved both the single-hop and multihop web browsing abilities of agents. See our code and data at https://mmina.cliangyu.com
Problem

Research questions and friction points this paper is trying to address.

Evaluating autonomous agents on evolving real-world multimodal websites
Assessing multihop web browsing for compositional Internet tasks
Improving agent performance in long-chain multihop tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evolving real-world multimodal websites evaluation
Multihop web browsing for long-range reasoning
Memory augmentation to replay past actions
🔎 Similar Papers
No similar papers found.