Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing vision–language–action (VLA) models, which suffer from feature collapse and inefficient training due to the tight coupling of high-level perception with sparse embodied action supervision, and struggle to capture fine-grained 3D state changes critical for decision-making. To overcome these challenges, we propose Pose-VLA, a decoupled pretraining framework that first learns general spatial priors in a unified camera-centric 3D space and then efficiently aligns robotic action trajectories using discrete pose tokens. By integrating multi-source 3D data with geometrically grounded trajectory supervision, our approach enables effective joint learning across vision, language, and action. Evaluated on RoboTwin 2.0 and LIBERO, Pose-VLA achieves average success rates of 79.5% and 96.0%, respectively, and demonstrates strong generalization across diverse objects with only 100 real-world demonstrations, significantly improving data efficiency and cross-task transferability.

Technology Category

Application Category

📝 Abstract
Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment through trajectory supervision. Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task, validating the efficiency of our pre-training paradigm.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
feature collapse
3D spatial priors
embodiment-specific action
training efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pose-VLA
decoupled pretraining
3D spatial priors
discrete pose tokens
vision-language-action