π€ AI Summary
This work addresses the limited action-generation capability of vision-language models (VLMs) in robotic manipulation tasks. To this end, it introduces diffusion-based vision-language models (d-VLMs) into robot policy learning for the first time, proposing LLaDA-VLAβthe first unified vision-language-diffusion-action model grounded in d-VLMs. Methodologically, it innovates with a localized special-token classification mechanism and a hierarchical structured decoding strategy, explicitly modeling temporal dependencies among actions while reducing the adaptation complexity from d-VLM latent spaces to robotic action spaces. Evaluated on simulated and real-world robotic grasping and placing tasks, LLaDA-VLA substantially outperforms existing state-of-the-art vision-language-action models. These results empirically validate the effectiveness and generalization potential of the diffusion paradigm for embodied intelligence policy learning.
π Abstract
The rapid progress of auto-regressive vision-language models (VLMs) has inspired growing interest in vision-language-action models (VLA) for robotic manipulation. Recently, masked diffusion models, a paradigm distinct from autoregressive models, have begun to demonstrate competitive performance in text generation and multimodal applications, leading to the development of a series of diffusion-based VLMs (d-VLMs). However, leveraging such models for robot policy learning remains largely unexplored. In this work, we present LLaDA-VLA, the first Vision-Language-Diffusion-Action model built upon pretrained d-VLMs for robotic manipulation. To effectively adapt d-VLMs to robotic domain, we introduce two key designs: (1) a localized special-token classification strategy that replaces full-vocabulary classification with special action token classification, reducing adaptation difficulty; (2) a hierarchical action-structured decoding strategy that decodes action sequences hierarchically considering the dependencies within and across actions. Extensive experiments demonstrate that LLaDA-VLA significantly outperforms state-of-the-art VLAs on both simulation and real-world robots.