Discover, Learn, and Reinforce: Scaling Vision-Language-Action Pretraining with Diverse RL-Generated Trajectories

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Current Vision-Language-Action (VLA) models rely heavily on costly, human-teleoperated demonstration data for pretraining—limiting scalability. Standard reinforcement learning (RL) trajectories suffer from insufficient behavioral diversity, hindering large-scale VLA training. To address this, we propose DLR, an information-theoretic framework for diverse policy discovery. DLR jointly optimizes entropy-regularized policy learning and behavior clustering to autonomously discover multiple high-success-rate, semantically separable behavioral modes within a single task, substantially improving state-action space coverage. Evaluated on the LIBERO benchmark, DLR increases trajectory diversity by 42% over baseline RL; VLA models pretrained on DLR-generated data achieve a 19.3% average performance gain on unseen tasks compared to conventional RL-based baselines, while enabling efficient data scaling. Our key contribution is the first integration of mutual information maximization with structured behavioral clustering to enable automated, large-scale generation of high-quality, highly diverse manipulation trajectories.

Technology Category

Application Category

📝 Abstract

Scaling vision-language-action (VLA) model pre-training requires large volumes of diverse, high-quality manipulation trajectories. Most current data is obtained via human teleoperation, which is expensive and difficult to scale. Reinforcement learning (RL) methods learn useful skills through autonomous exploration, making them a viable approach for generating data. However, standard RL training collapses to a narrow execution pattern, limiting its utility for large-scale pre-training. We propose Discover, Lea rn and Reinforce (DLR), an information-theoretic pattern discovery framework that generates multiple distinct, high-success behavioral patterns for VLA pretraining. Empirically, DLR generates a markedly more diverse trajectory corpus on LIBERO. Specifically, it learns multiple distinct, high-success strategies for the same task where standard RL discovers only one, and hence it covers substantially broader regions of the state-action space. When adapted to unseen downstream task suites, VLA models pretrained on our diverse RL data surpass counterparts trained on equal-sized standard RL datasets. Moreover, DLR exhibits positive data-scaling behavior that single-pattern RL lacks. These results position multi-pattern RL as a practical, scalable data engine for embodied foundation models.

Problem

Research questions and friction points this paper is trying to address.

Scaling vision-language-action pretraining requires diverse manipulation trajectories

Current human teleoperation data is expensive and difficult to scale

Standard reinforcement learning collapses to narrow execution patterns limiting utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Information-theoretic pattern discovery framework generates diverse trajectories

Learns multiple distinct high-success strategies for same tasks

Enables scalable data generation for vision-language-action pretraining

🔎 Similar Papers

No similar papers found.