🤖 AI Summary
This work addresses the limitation of existing text-guided pretraining paradigms, which emphasize high-level semantics yet struggle to capture the low-level spatial and physical knowledge essential for embodied intelligence. To bridge this gap, we propose GEM—a generative-supervised embodied vision-language model that, for the first time, incorporates depth map generation as a generative supervision signal during VLM pretraining, jointly optimizing semantic understanding and physical interaction capabilities. We construct GEM-4M, a large-scale multitask embodied dataset comprising 4 million samples, and perform end-to-end pretraining at scale. Experimental results demonstrate that GEM achieves state-of-the-art performance across multiple embodied AI benchmarks, with its deployed variant, GEM-VLA, significantly improving task success rates in both simulated and real-world environments.
📝 Abstract
Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations. Code, models, and datasets are available at https://zhaorw02.github.io/GEM/