🤖 AI Summary
Traditional embodied intelligence systems rely on multiple isolated modules to handle perception, reasoning, imagination, and action separately, hindering synergistic optimization. This work proposes the first unified paradigm for embodied foundation models, centered around a single vision-language model (VLM) that autoregressively generates task-oriented chains of thought in a single forward pass and simultaneously predicts future video frames and actions via a Unified Future Generator (UFG). The model achieves, for the first time, end-to-end joint training and shared representations across all four core capabilities, thereby overcoming the fragmented multi-module paradigm. It attains a score of 64.7 across eight VLM benchmarks, ranks first on WorldArena with 66.03, and achieves 93.5 on RoboTwin—second among action methods—demonstrating its strong expertise and comprehensive performance.
📝 Abstract
We present Pelican-Unified 1.0, the first embodied foundation model trained according to the principle of unification. Pelican-Unified 1.0 uses a single VLM as a unified understanding module, mapping scenes, instructions, visual contexts, and action histories into a shared semantic space. The same VLM also serves as a unified reasoning module, autoregressively producing task-, action-, and future-oriented chains of thought in a single forward pass and projecting the final hidden state into a dense latent variable. A Unified Future Generator (UFG) then conditions on this latent variable and jointly generates future videos and future actions through two modality-specific output heads within the same denoising process. The language, video, and action losses are all backpropagated into the shared representation, enabling the model to jointly optimize understanding, reasoning, imagination, and action during training, rather than training three isolated expert systems.
Experiments demonstrate that unification does not imply compromise. With a single checkpoint, Pelican-Unified 1.0 achieves strong performance across all three capabilities: 64.7 on eight VLM benchmarks, the best among comparable-scale models; 66.03 on WorldArena, ranking first; and 93.5 on RoboTwin, the second-best average among compared action methods. These results show that the unified paradigm succeeds in preserving specialist strength while bringing understanding, reasoning, imagination, and action into one model.