HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language-action (VLA) models face two key bottlenecks: autoregressive paradigms impair action continuity, while pure diffusion approaches rely on static multimodal features, degrading reasoning capability. To address these, we propose the first unified VLA framework that innovatively embeds conditional diffusion modeling into an autoregressive language model architecture—enabling joint optimization of temporal coherence and physical plausibility in action generation. We further design an adaptive dual-strategy fusion mechanism that dynamically integrates autoregressive prediction and diffusion-based refinement during both training and inference. Leveraging multimodal feature alignment and end-to-end co-training, our model achieves state-of-the-art performance across simulated and real-world single- and dual-arm robotic tasks. Crucially, it demonstrates strong generalization to unseen robot configurations and robust dexterous manipulation stability.

Technology Category

Application Category

📝 Abstract
Recent advancements in vision-language models (VLMs) for common-sense reasoning have led to the development of vision-language-action (VLA) models, enabling robots to perform generalized manipulation. Although existing autoregressive VLA methods leverage large-scale pretrained knowledge, they disrupt the continuity of actions. Meanwhile, some VLA methods incorporate an additional diffusion head to predict continuous actions, relying solely on VLM-extracted features, which limits their reasoning capabilities. In this paper, we introduce HybridVLA, a unified framework that seamlessly integrates the strengths of both autoregressive and diffusion policies within a single large language model, rather than simply connecting them. To bridge the generation gap, a collaborative training recipe is proposed that injects the diffusion modeling directly into the next-token prediction. With this recipe, we find that these two forms of action prediction not only reinforce each other but also exhibit varying performance across different tasks. Therefore, we design a collaborative action ensemble mechanism that adaptively fuses these two predictions, leading to more robust control. In experiments, HybridVLA outperforms previous state-of-the-art VLA methods across various simulation and real-world tasks, including both single-arm and dual-arm robots, while demonstrating stable manipulation in previously unseen configurations.
Problem

Research questions and friction points this paper is trying to address.

Integrates autoregressive and diffusion policies for robust robot control.
Addresses action continuity disruption in vision-language-action models.
Enhances reasoning and manipulation in unseen robot configurations.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework combining autoregressive and diffusion policies
Collaborative training integrates diffusion into next-token prediction
Adaptive ensemble mechanism for robust action prediction
🔎 Similar Papers
2024-03-04Computer Vision and Pattern RecognitionCitations: 3
J
Jiaming Liu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; Beijing Academy of Artificial Intelligence (BAAI)
H
Hao Chen
CUHK
Pengju An
Pengju An
Peking University
AIGC、LLM
Zhuoyang Liu
Zhuoyang Liu
Peking University
Embodied AIComputer Vision
Renrui Zhang
Renrui Zhang
Seed ByteDance & MMLab & PKU
Large Multimodal ModelGenerative ModelEmbodied AI
Chenyang Gu
Chenyang Gu
Undergraduate, Peking University
Embodied AIRobotic Manipulation
X
Xiaoqi Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; Beijing Academy of Artificial Intelligence (BAAI)
Ziyu Guo
Ziyu Guo
The Chinese University of Hong Kong
Multi-modality LearningLLM/VLMs3D Vision
S
Sixiang Chen
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; Beijing Academy of Artificial Intelligence (BAAI)
M
Mengzhen Liu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; Beijing Academy of Artificial Intelligence (BAAI)
Chengkai Hou
Chengkai Hou
Peking University
Robot
M
Mengdi Zhao
Beijing Academy of Artificial Intelligence (BAAI)
K
KC alex Zhou
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; Beijing Academy of Artificial Intelligence (BAAI)
P
Pheng-Ann Heng
CUHK
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models