Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current AI agents typically decouple embodied interaction from web-based knowledge acquisition, limiting their ability to jointly reason across physical and digital domains for complex tasks—e.g., cooking via online recipe search or dynamic map-based navigation. To address this, we propose the first unified agent paradigm that deeply integrates embodiment and web reasoning. We introduce a comprehensive simulation platform featuring 3D indoor/outdoor environments, interactive web interfaces, and multi-step embodied planning, alongside a multi-task benchmark covering cooking, geolocation, and more. We further define systematic, cross-modal evaluation metrics to quantify cross-domain collaborative reasoning. Experiments reveal substantial performance gaps between state-of-the-art models and human-level reasoning in such hybrid settings. All components—including environments, datasets, code, and evaluation tools—are fully open-sourced, establishing a foundational resource for embodied-web intelligence research.

Technology Category

Application Category

📝 Abstract
AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access. All datasets, codes and websites are publicly available at our project page https://embodied-web-agent.github.io/.
Problem

Research questions and friction points this paper is trying to address.

AI agents lack integration between digital and physical intelligence
Tasks requiring combined web and real-world reasoning are limited
No unified platform exists for assessing cross-domain agent intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates 3D environments with web interfaces
Benchmark tests cross-domain intelligence tasks
Combines physical actions with digital reasoning
🔎 Similar Papers
No similar papers found.
Yining Hong
Yining Hong
Stanford
Computer ScienceEmbodied AIComputer VisionNatural Language Processing
R
Rui Sun
University of California, Los Angeles
Bingxuan Li
Bingxuan Li
UIUC
Xingcheng Yao
Xingcheng Yao
Moonshot AI
M
Maxine Wu
University of California, Los Angeles
A
Alexander Chien
University of California, Los Angeles
Da Yin
Da Yin
Meta FAIR
Natural Language Processing
Ying Nian Wu
Ying Nian Wu
UCLA Department of Statistics and Data Science
Generative AIRepresentation learningComputer visionComputational neuroscienceBioinformatics
Z
Zhecan Wang
University of California, Los Angeles
K
Kai-Wei Chang
University of California, Los Angeles