🤖 AI Summary
Current Vision-Language-Action (VLA) models rely heavily on costly, human-teleoperated demonstration data for pretraining—limiting scalability. Standard reinforcement learning (RL) trajectories suffer from insufficient behavioral diversity, hindering large-scale VLA training. To address this, we propose DLR, an information-theoretic framework for diverse policy discovery. DLR jointly optimizes entropy-regularized policy learning and behavior clustering to autonomously discover multiple high-success-rate, semantically separable behavioral modes within a single task, substantially improving state-action space coverage. Evaluated on the LIBERO benchmark, DLR increases trajectory diversity by 42% over baseline RL; VLA models pretrained on DLR-generated data achieve a 19.3% average performance gain on unseen tasks compared to conventional RL-based baselines, while enabling efficient data scaling. Our key contribution is the first integration of mutual information maximization with structured behavioral clustering to enable automated, large-scale generation of high-quality, highly diverse manipulation trajectories.
📝 Abstract
Scaling vision-language-action (VLA) model pre-training requires large volumes of diverse, high-quality manipulation trajectories. Most current data is obtained via human teleoperation, which is expensive and difficult to scale. Reinforcement learning (RL) methods learn useful skills through autonomous exploration, making them a viable approach for generating data. However, standard RL training collapses to a narrow execution pattern, limiting its utility for large-scale pre-training. We propose Discover, Lea rn and Reinforce (DLR), an information-theoretic pattern discovery framework that generates multiple distinct, high-success behavioral patterns for VLA pretraining. Empirically, DLR generates a markedly more diverse trajectory corpus on LIBERO. Specifically, it learns multiple distinct, high-success strategies for the same task where standard RL discovers only one, and hence it covers substantially broader regions of the state-action space. When adapted to unseen downstream task suites, VLA models pretrained on our diverse RL data surpass counterparts trained on equal-sized standard RL datasets. Moreover, DLR exhibits positive data-scaling behavior that single-pattern RL lacks. These results position multi-pattern RL as a practical, scalable data engine for embodied foundation models.