D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high cost of collecting trajectory data on physical robots, this paper proposes the D2E framework, which explores pretraining embodied agents on large-scale vision-action interaction data from desktop gaming environments and transfers the learned representations to real-world robotic tasks. Methodologically, we establish an end-to-end open-source pipeline: (i) multimodal data compression using the OWA Toolkit; (ii) internet-scale pseudo-labeling via timestamped event-driven learning with Generalist-IDM; and (iii) cross-domain transfer from visual-action representation to physical manipulation and navigation via VAPT. Our core contribution is the empirical validation that “sensor–action” priors exhibit cross-domain generalizability between digital and embodied domains, establishing desktop pretraining as a novel paradigm for robot learning. Experiments demonstrate state-of-the-art performance on the LIBERO manipulation and CANVAS navigation benchmarks, achieving success rates of 96.6% and 83.3%, respectively—substantially outperforming existing transfer-learning approaches.

Technology Category

Application Category

📝 Abstract
Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/
Problem

Research questions and friction points this paper is trying to address.

Scaling embodied AI by leveraging desktop data for pretraining
Transferring desktop-pretrained models to physical robotics tasks
Overcoming costly physical data collection with digital interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Desktop interactions serve as pretraining substrate for robotics
OWA toolkit compresses diverse desktop data into standardized format
Generalist-IDM enables zero-shot generalization across unseen games
🔎 Similar Papers
No similar papers found.