SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing embodied AI research is largely confined to indoor environments, lacking high-fidelity, dynamic, multimodal simulation platforms for open urban settings. This paper introduces the first large-scale, ray-traced urban simulation framework built on Unreal Engine 5, enabling procedurally generated photorealistic cityscapes, dynamic pedestrian and traffic flow modeling, multi-robot ROS/ROS2 co-control, and embodied communication. We establish two benchmark tasks: multimodal instruction-driven long-horizon safe navigation and multi-agent collaborative search. Our framework uniquely integrates vision-language understanding, 3D spatial reasoning, human-vehicle shared-autonomy safety planning, and distributed collaborative communication—constituting the first comprehensive evaluation suite for city-scale embodied intelligence. Experiments expose systematic limitations of state-of-the-art vision-language models in perceptual robustness, long-horizon planning, and collaborative intent modeling, thereby providing a new training environment and evaluation standard for general embodied AI.

Technology Category

Application Category

📝 Abstract
Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open-ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld-Robotics~(SWR), a simulation platform for embodied AI in large-scale, photorealistic urban environments. Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes populated with dynamic elements such as pedestrians and traffic systems, surpassing prior urban simulations in realism, complexity, and scalability. It also supports multi-robot control and communication. With these key features, we build two challenging robot benchmarks: (1) a multimodal instruction-following task, where a robot must follow vision-language navigation instructions to reach a destination in the presence of pedestrians and traffic; and (2) a multi-agent search task, where two robots must communicate to cooperatively locate and meet each other. Unlike existing benchmarks, these two new benchmarks comprehensively evaluate a wide range of critical robot capacities in realistic scenarios, including (1) multimodal instructions grounding, (2) 3D spatial reasoning in large environments, (3) safe, long-range navigation with people and traffic, (4) multi-robot collaboration, and (5) grounded communication. Our experimental results demonstrate that state-of-the-art models, including vision-language models (VLMs), struggle with our tasks, lacking robust perception, reasoning, and planning abilities necessary for urban environments.
Problem

Research questions and friction points this paper is trying to address.

Develops a photorealistic urban simulation for robot navigation and collaboration
Creates benchmarks to test multimodal instruction following in dynamic environments
Evaluates robot capacities for safe navigation and multi-agent cooperation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Procedurally generates unlimited photorealistic urban scenes
Supports multi-robot control and communication features
Builds challenging multimodal robot navigation benchmarks
🔎 Similar Papers
No similar papers found.
Y
Yan Zhuang
University of Virginia
Jiawei Ren
Jiawei Ren
NVIDIA
Computer VisionMachine LearningComputer Graphics
X
Xiaokang Ye
UC San Diego
J
Jianzhi Shen
Johns Hopkins University
R
Ruixuan Zhang
Johns Hopkins University
T
Tianai Yue
Johns Hopkins University
M
Muhammad Faayez
Johns Hopkins University
X
Xuhong He
Carnegie Mellon University
X
Xiyan Zhang
Johns Hopkins University
Ziqiao Ma
Ziqiao Ma
University of Michigan
Machine LearningComputational Linguistics
Lianhui Qin
Lianhui Qin
UC San Diego, Computer Science and Engineering
Natural Language ProcessingMachine Learning
Zhiting Hu
Zhiting Hu
Assistant Professor at UC San Diego
Machine LearningArtificial IntelligenceNatural Language Processing
Tianmin Shu
Tianmin Shu
Assistant Professor, JHU
Artificial IntelligenceCognitive Science