🤖 AI Summary
To address the limited visual navigation and object tracking capabilities of embodied agents in open-world environments—stemming from inadequate modeling of complex dynamic scenes—this paper introduces the first large-scale, photorealistic, interactive dynamic virtual environment suite tailored for embodied AI. Built upon a deeply optimized Unreal Engine platform and the UnrealCV Python API, it supports multi-agent collaboration, low-latency closed-loop control, and distributed training. The environment features diverse terrains, realistic lighting, physically grounded interactions, and rich dynamic entities. Experiments demonstrate substantial improvements in reinforcement learning and vision-language model agents’ performance on complex 3D structural understanding and real-time spatial reasoning tasks. Crucially, the study identifies closed-loop control latency and misalignment between geometric and semantic representations as key bottlenecks. This work establishes a new benchmark and technical paradigm for evaluating open-world embodied intelligence.
📝 Abstract
We introduce UnrealZoo, a rich collection of photo-realistic 3D virtual worlds built on Unreal Engine, designed to reflect the complexity and variability of the open worlds. Additionally, we offer a variety of playable entities for embodied AI agents. Based on UnrealCV, we provide a suite of easy-to-use Python APIs and tools for various potential applications, such as data collection, environment augmentation, distributed training, and benchmarking. We optimize the rendering and communication efficiency of UnrealCV to support advanced applications, such as multi-agent interaction. Our experiments benchmark agents in various complex scenes, focusing on visual navigation and tracking, which are fundamental capabilities for embodied visual intelligence. The results yield valuable insights into the advantages of diverse training environments for reinforcement learning (RL) agents and the challenges faced by current embodied vision agents, including those based on RL and large vision-language models (VLMs), in open worlds. These challenges involve latency in closed-loop control in dynamic scenes and reasoning about 3D spatial structures in unstructured terrain.