EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language-action (VLA) models exhibit substantial performance gaps relative to human capabilities in embodied multimodal reasoning and open-world, long-horizon manipulation. Method: We propose EO-1—the first unified pretraining architecture enabling interleaved, joint modeling of visual, textual, and action modalities—and introduce EO-Data1.5M, a large-scale embodied robotics dataset. Our approach innovatively integrates autoregressive decoding with flow-matching denoising to achieve end-to-end joint modeling of images, text, video, and action sequences. Contribution/Results: EO-1 demonstrates strong generalization across diverse real-world robotic platforms, successfully executing complex dexterous manipulation tasks. It achieves near-human performance in multimodal understanding, cross-modal alignment, and real-time action generation. By unifying heterogeneous modalities within a single scalable framework, EO-1 establishes a foundational paradigm for general-purpose embodied intelligence.

Technology Category

Application Category

📝 Abstract
The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.
Problem

Research questions and friction points this paper is trying to address.

Achieving human-level flexibility in interleaved reasoning and interaction
Developing unified embodied foundation model for multimodal reasoning
Enabling seamless robot action generation and embodied understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaved vision-text-action pre-training methodology
Unified architecture processing multimodal inputs indiscriminately
Auto-regressive decoding with flow matching denoising training
🔎 Similar Papers
No similar papers found.
Delin Qu
Delin Qu
PhD Candidate of Fudan University
Embodied AI3D VisionMultimodal Generation
H
Haoming Song
Shanghai AI Laboratory; Fudan University; AgiBot; Northwestern Polytechnical University
Qizhi Chen
Qizhi Chen
PhD Candidate of Zhejiang University
Multimodal ReasoningEmbodied AI3D Vision
Z
Zhaoqing Chen
Shanghai AI Laboratory; Fudan University; AgiBot; Northwestern Polytechnical University
Xianqiang Gao
Xianqiang Gao
PhD Student of University of Science and Technology of China, Shanghai AI Lab
Modi Shi
Modi Shi
Beihang University
embodied ai
G
Guanghui Ren
Shanghai AI Laboratory; Fudan University; AgiBot; Northwestern Polytechnical University
Maoqing Yao
Maoqing Yao
Google
B
Bin Zhao
Shanghai AI Laboratory; Fudan University; AgiBot; Northwestern Polytechnical University
D
Dong Wang
Shanghai AI Laboratory; Fudan University; AgiBot; Northwestern Polytechnical University