Discover, Learn, and Reinforce: Scaling Vision-Language-Action Pretraining with Diverse RL-Generated Trajectories

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current Vision-Language-Action (VLA) models rely heavily on costly, human-teleoperated demonstration data for pretraining—limiting scalability. Standard reinforcement learning (RL) trajectories suffer from insufficient behavioral diversity, hindering large-scale VLA training. To address this, we propose DLR, an information-theoretic framework for diverse policy discovery. DLR jointly optimizes entropy-regularized policy learning and behavior clustering to autonomously discover multiple high-success-rate, semantically separable behavioral modes within a single task, substantially improving state-action space coverage. Evaluated on the LIBERO benchmark, DLR increases trajectory diversity by 42% over baseline RL; VLA models pretrained on DLR-generated data achieve a 19.3% average performance gain on unseen tasks compared to conventional RL-based baselines, while enabling efficient data scaling. Our key contribution is the first integration of mutual information maximization with structured behavioral clustering to enable automated, large-scale generation of high-quality, highly diverse manipulation trajectories.

Technology Category

Application Category

📝 Abstract
Scaling vision-language-action (VLA) model pre-training requires large volumes of diverse, high-quality manipulation trajectories. Most current data is obtained via human teleoperation, which is expensive and difficult to scale. Reinforcement learning (RL) methods learn useful skills through autonomous exploration, making them a viable approach for generating data. However, standard RL training collapses to a narrow execution pattern, limiting its utility for large-scale pre-training. We propose Discover, Lea rn and Reinforce (DLR), an information-theoretic pattern discovery framework that generates multiple distinct, high-success behavioral patterns for VLA pretraining. Empirically, DLR generates a markedly more diverse trajectory corpus on LIBERO. Specifically, it learns multiple distinct, high-success strategies for the same task where standard RL discovers only one, and hence it covers substantially broader regions of the state-action space. When adapted to unseen downstream task suites, VLA models pretrained on our diverse RL data surpass counterparts trained on equal-sized standard RL datasets. Moreover, DLR exhibits positive data-scaling behavior that single-pattern RL lacks. These results position multi-pattern RL as a practical, scalable data engine for embodied foundation models.
Problem

Research questions and friction points this paper is trying to address.

Scaling vision-language-action pretraining requires diverse manipulation trajectories
Current human teleoperation data is expensive and difficult to scale
Standard reinforcement learning collapses to narrow execution patterns limiting utility
Innovation

Methods, ideas, or system contributions that make the work stand out.

Information-theoretic pattern discovery framework generates diverse trajectories
Learns multiple distinct high-success strategies for same tasks
Enables scalable data generation for vision-language-action pretraining
🔎 Similar Papers
No similar papers found.
Rushuai Yang
Rushuai Yang
Hong Kong University of Science and Technology
Reinforcement LearningEmbodied AI
Z
Zhiyuan Feng
Tsinghua University
T
Tianxiang Zhang
Wuhan University
K
Kaixin Wang
Microsoft Research
C
Chuheng Zhang
Microsoft Research
L
Li Zhao
Microsoft Research
X
Xiu Su
Central South University
Y
Yi Chen
The Hong Kong University of Science and Technology
J
Jiang Bian
Microsoft Research