Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Embodied intelligence faces dual challenges of scarce real-world data and low algorithmic efficiency. To address these, we propose Deliberate Practice Policy Optimization (DPPO), the first metacognition-driven “Metaloop” training framework. DPPO dynamically alternates between supervised fine-tuning and reinforcement learning, while automatically identifying capability gaps and reallocating computational resources toward underperforming skills—enabling efficient learning from sparse embodied data. Furthermore, we integrate vision-language modeling with embodied decision-making to introduce Pelican-VL 1.0, a unified multimodal foundation model. Experiments demonstrate that Pelican-VL 1.0 achieves a 20.3% absolute improvement over strong baselines on standard embodied tasks and outperforms open-source billion-parameter models by 10.6%. All code and models are publicly released, establishing a new paradigm for efficient, scalable development of embodied agents.

Technology Category

Application Category

📝 Abstract
Developing a universal and versatile embodied intelligence system presents two primary challenges: the critical embodied data bottleneck, where real-world data is scarce and expensive, and the algorithmic inefficiency of existing methods, which are resource-prohibitive. To address these limitations, we introduce Deliberate Practice Policy Optimization (DPPO), a metacognitive ``Metaloop'' training framework that dynamically alternates between supervised fine-tuning (competence expansion) and reinforcement learning (skill refinement). This enables automatic weakness identification and targeted resource allocation, specifically designed to maximize learning efficiency from sparse, finite data. Theoretically, DPPO can be formalised as a unified preference-learning framework. Empirically, training a vision-language embodied model with DPPO, referred to as Pelican-VL 1.0, yields a 20.3% performance improvement over the base model and surpasses open-source models at the 100B-parameter scale by 10.6%. We are open-sourcing both the models and code, providing the first systematic framework that alleviates the data and resource bottleneck and enables the community to build versatile embodied agents efficiently.
Problem

Research questions and friction points this paper is trying to address.

Overcoming scarce real-world data for embodied intelligence systems
Improving algorithmic inefficiency in resource-intensive training methods
Developing unified framework for weakness identification and targeted learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Metacognitive Metaloop alternates SFT and RL
Automatic weakness identification maximizes sparse data learning
Unified preference-learning framework formalizes DPPO theoretically
🔎 Similar Papers
No similar papers found.
Y
Yi Zhang
X-Humanoid
Che Liu
Che Liu
Imperial College London
Multimodal learningAI4Medicine
X
Xiancong Ren
X-Humanoid
H
Hanchu Ni
Peking University
Yingji Zhang
Yingji Zhang
University of Manchester
Computational LinguisticsRepresentation LearningDisentanglementMulti-modal Learning
S
Shuai Zhang
Westlake University
Z
Zeyuan Ding
X-Humanoid
J
Jiayu Hu
X-Humanoid
Haozhe Shan
Haozhe Shan
Fudan University
J
Junbo Qi
Waseda University
Yan Bai
Yan Bai
University of Rochester
macroeconomicsinternational macroeconomics
D
Dengjie Li
X-Humanoid
Jiachen Luo
Jiachen Luo
Queen Mary University of London
Y
Yidong Wang
Peking University
Y
Yong Dai
X-Humanoid
Zenglin Xu
Zenglin Xu
Fudan University
Machine LearningTrustworthy AIFederated LearningLarge Language ModelsTime Series Analysis
B
Bin Shen
Celonis AI
Q
Qifan Wang
Meta AI
J
Jian Tang
X-Humanoid
X
Xiaozhu Ju
X-Humanoid