🤖 AI Summary
Existing vision–motor agents exhibit poor generalization in 3D environments; reinforcement learning (RL) approaches often overfit to specific tasks, hindering zero-shot cross-scene transfer.
Method: We propose Cross-View Goal Specification (CVGS) as a unified multi-task representation framework and introduce an automated task synthesis mechanism grounded in Minecraft to overcome manual task design bottlenecks. Our approach integrates RL fine-tuning, cross-view goal embedding, and distributed RL training for large-scale joint multi-task optimization.
Contribution/Results: Experiments demonstrate a 4× improvement in interaction success rate. The agent achieves strong zero-shot spatial reasoning and interaction generalization—not only in unseen virtual environments but also in real-world settings—significantly advancing research on transferable spatial intelligence for vision–motor agents.
📝 Abstract
While Reinforcement Learning (RL) has achieved remarkable success in language modeling, its triumph hasn't yet fully translated to visuomotor agents. A primary challenge in RL models is their tendency to overfit specific tasks or environments, thereby hindering the acquisition of generalizable behaviors across diverse settings. This paper provides a preliminary answer to this challenge by demonstrating that RL-finetuned visuomotor agents in Minecraft can achieve zero-shot generalization to unseen worlds. Specifically, we explore RL's potential to enhance generalizable spatial reasoning and interaction capabilities in 3D worlds. To address challenges in multi-task RL representation, we analyze and establish cross-view goal specification as a unified multi-task goal space for visuomotor policies. Furthermore, to overcome the significant bottleneck of manual task design, we propose automated task synthesis within the highly customizable Minecraft environment for large-scale multi-task RL training, and we construct an efficient distributed RL framework to support this. Experimental results show RL significantly boosts interaction success rates by $4 imes$ and enables zero-shot generalization of spatial reasoning across diverse environments, including real-world settings. Our findings underscore the immense potential of RL training in 3D simulated environments, especially those amenable to large-scale task generation, for significantly advancing visuomotor agents' spatial reasoning.