Beyond Human Demonstrations: Diffusion-Based Reinforcement Learning to Generate Data for VLA Training

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-Language-Action (VLA) models rely heavily on costly human demonstration data, limiting their scalability. To address this, we propose Diffusion-RL—a novel framework that synergistically integrates diffusion modeling with reinforcement learning. Leveraging the implicit regularization inherent in diffusion processes, our method autonomously generates high-quality, low-variance, temporally smooth, and semantically consistent training trajectories—effectively mitigating exploration challenges in sparse-reward and long-horizon tasks. Crucially, Diffusion-RL operates without human demonstrations: it iteratively refines policies via denoising-based optimization, enabling self-supervised behavioral acquisition while preserving action diversity and enhancing structural coherence. Evaluated on the LIBERO benchmark, Diffusion-RL achieves an average success rate of 81.9%, outperforming supervised human-demonstration baselines by 5.3%. This advancement significantly promotes the development of VLA models toward self-supervised, low-cost, and generalizable robotic learning.

Technology Category

Application Category

📝 Abstract
Vision-language-action (VLA) models have shown strong generalization across tasks and embodiments; however, their reliance on large-scale human demonstrations limits their scalability owing to the cost and effort of manual data collection. Reinforcement learning (RL) offers a potential alternative to generate demonstrations autonomously, yet conventional RL algorithms often struggle on long-horizon manipulation tasks with sparse rewards. In this paper, we propose a modified diffusion policy optimization algorithm to generate high-quality and low-variance trajectories, which contributes to a diffusion RL-powered VLA training pipeline. Our algorithm benefits from not only the high expressiveness of diffusion models to explore complex and diverse behaviors but also the implicit regularization of the iterative denoising process to yield smooth and consistent demonstrations. We evaluate our approach on the LIBERO benchmark, which includes 130 long-horizon manipulation tasks, and show that the generated trajectories are smoother and more consistent than both human demonstrations and those from standard Gaussian RL policies. Further, training a VLA model exclusively on the diffusion RL-generated data achieves an average success rate of 81.9%, which outperforms the model trained on human data by +5.3% and that on Gaussian RL-generated data by +12.6%. The results highlight our diffusion RL as an effective alternative for generating abundant, high-quality, and low-variance demonstrations for VLA models.
Problem

Research questions and friction points this paper is trying to address.

Overcoming reliance on costly human demonstrations for VLA model training
Addressing poor performance of conventional RL on long-horizon manipulation tasks
Generating high-quality autonomous demonstrations to scale VLA training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modified diffusion policy optimization algorithm
Generates high-quality low-variance trajectories
Uses iterative denoising for smooth demonstrations
🔎 Similar Papers
No similar papers found.
Rushuai Yang
Rushuai Yang
Hong Kong University of Science and Technology
Reinforcement LearningEmbodied AI
H
Hangxing Wei
Wuhan University, Wuhan, China
R
Ran Zhang
University of Chinese Academy of Sciences, Beijing, China
Z
Zhiyuan Feng
Tsinghua University, Beijing, China
X
Xiaoyu Chen
Tsinghua University, Beijing, China
T
Tong Li
Northwestern Polytechnical University, Xi’an, China
C
Chuheng Zhang
Microsoft Research Asia, Beijing, China
L
Li Zhao
Microsoft Research Asia, Beijing, China
J
Jiang Bian
Microsoft Research Asia, Beijing, China
X
Xiu Su
Big Data Institute, Central South University, Changsha, China
Y
Yi Chen
Hong Kong University of Science and Technology, Hong Kong, China