🤖 AI Summary
Current vision-language-action (VLA) models exhibit substantial performance gaps relative to human capabilities in embodied multimodal reasoning and open-world, long-horizon manipulation.
Method: We propose EO-1—the first unified pretraining architecture enabling interleaved, joint modeling of visual, textual, and action modalities—and introduce EO-Data1.5M, a large-scale embodied robotics dataset. Our approach innovatively integrates autoregressive decoding with flow-matching denoising to achieve end-to-end joint modeling of images, text, video, and action sequences.
Contribution/Results: EO-1 demonstrates strong generalization across diverse real-world robotic platforms, successfully executing complex dexterous manipulation tasks. It achieves near-human performance in multimodal understanding, cross-modal alignment, and real-time action generation. By unifying heterogeneous modalities within a single scalable framework, EO-1 establishes a foundational paradigm for general-purpose embodied intelligence.
📝 Abstract
The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.