WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Humanoid robots face significant challenges in large-scale, pose-adaptive loco-manipulation tasks, including low motion-manipulation coordination accuracy and limited operational workspace. To address these issues, this paper proposes a unified latent-space vision-language-action (VLA) framework. It introduces a novel VLA learning paradigm grounded exclusively in action-free observational videos—eliminating reliance on costly, expert-annotated action labels. We further design a locomotion-manipulation orchestration (LMO) reinforcement learning policy that balances precision, stability, and generalization. The framework integrates three core components: latent-space representation learning, unsupervised video-driven action modeling, and task-specific RL optimization. Evaluated on the AgiBot X2 platform, our method achieves a 21.3% performance improvement over baseline approaches, demonstrating substantially enhanced cross-task generalization and scalability for large-space manipulation tasks.

Technology Category

Application Category

📝 Abstract
Humanoid robots require precise locomotion and dexterous manipulation to perform challenging loco-manipulation tasks. Yet existing approaches, modular or end-to-end, are deficient in manipulation-aware locomotion. This confines the robot to a limited workspace, preventing it from performing large-space loco-manipulation. We attribute this to: (1) the challenge of acquiring loco-manipulation knowledge due to the scarcity of humanoid teleoperation data, and (2) the difficulty of faithfully and reliably executing locomotion commands, stemming from the limited precision and stability of existing RL controllers. To acquire richer loco-manipulation knowledge, we propose a unified latent learning framework that enables Vision-Language-Action (VLA) system to learn from low-cost action-free egocentric videos. Moreover, an efficient human data collection pipeline is devised to augment the dataset and scale the benefits. To more precisely execute the desired locomotion commands, we present a loco-manipulation-oriented (LMO) RL policy specifically tailored for accurate and stable core loco-manipulation movements, such as advancing, turning, and squatting. Building on these components, we introduce WholeBodyVLA, a unified framework for humanoid loco-manipulation. To the best of our knowledge, WholeBodyVLA is one of its kind enabling large-space humanoid loco-manipulation. It is verified via comprehensive experiments on the AgiBot X2 humanoid, outperforming prior baseline by 21.3%. It also demonstrates strong generalization and high extensibility across a broad range of tasks.
Problem

Research questions and friction points this paper is trying to address.

Humanoid robots lack manipulation-aware locomotion for large-space tasks.
Scarcity of teleoperation data hinders loco-manipulation knowledge acquisition.
Existing RL controllers lack precision for stable locomotion command execution.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified latent framework learns from action-free videos.
Efficient human data pipeline augments dataset for scaling.
Loco-manipulation RL policy ensures precise stable movements.
🔎 Similar Papers
No similar papers found.
H
Haoran Jiang
Fudan University
J
Jin Chen
Fudan University
Qingwen Bu
Qingwen Bu
HKU | OpenDriveLab
Robot LearningComputer VisionMachine Learning
L
Li Chen
OpenDriveLab at The University of Hong Kong
Modi Shi
Modi Shi
Beihang University
embodied ai
Y
Yanjie Zhang
AgiBot Inc.
D
Delong Li
AgiBot Inc.
Chuanzhe Suo
Chuanzhe Suo
AgiBot Inc.
C
Chuang Wang
AgiBot Inc.
Z
Zhihui Peng
AgiBot Inc.
H
Hongyang Li
OpenDriveLab at The University of Hong Kong