VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Current Vision-Language-Action (VLA) models tightly couple planning paradigms, representation design, and network architecture, hindering performance attribution. To address this, we propose VLA-OS, a unified VLA architecture family that achieves the first orthogonal decoupling of planning paradigms, network architectures, and training data. Our approach employs modular planning heads, cross-modal aligned representation learning, and hierarchical task-action joint training—enabling seamless adaptation across 2D/3D, simulation/real-world, rigid/deformable objects, and gripper/dexterous-hand settings. Experiments reveal that vision-anchored planning representations consistently outperform language-based ones. Among evaluated paradigms, Hierarchical-VLA emerges as the overall optimal: it improves average task success rate by 12.7% in long-horizon manipulation, while significantly enhancing transferability, continual learning capability, and cross-environment generalization.

Technology Category

Application Category

📝 Abstract

Recent studies on Vision-Language-Action (VLA) models have shifted from the end-to-end action-generation paradigm toward a pipeline involving task planning followed by action generation, demonstrating improved performance on various complex, long-horizon manipulation tasks. However, existing approaches vary significantly in terms of network architectures, planning paradigms, representations, and training data sources, making it challenging for researchers to identify the precise sources of performance gains and components to be further improved. To systematically investigate the impacts of different planning paradigms and representations isolating from network architectures and training data, in this paper, we introduce VLA-OS, a unified VLA architecture series capable of various task planning paradigms, and design a comprehensive suite of controlled experiments across diverse object categories (rigid and deformable), visual modalities (2D and 3D), environments (simulation and real-world), and end-effectors (grippers and dexterous hands). Our results demonstrate that: 1) visually grounded planning representations are generally better than language planning representations; 2) the Hierarchical-VLA paradigm generally achieves superior or comparable performance than other paradigms on task performance, pretraining, generalization ability, scalability, and continual learning ability, albeit at the cost of slower training and inference speeds.

Problem

Research questions and friction points this paper is trying to address.

Analyzing impacts of planning paradigms in VLA models

Comparing visually grounded vs language planning representations

Evaluating Hierarchical-VLA paradigm performance across diverse settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified VLA architecture for diverse planning paradigms

Controlled experiments across various object categories

Hierarchical-VLA paradigm enhances performance metrics

🔎 Similar Papers

No similar papers found.