Refined Policy Distillation: From VLA Generalists to RL Experts

๐Ÿ“… 2025-03-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Generalist vision-language-action (VLA) models exhibit low success rates, fragile generalization, and heavy reliance on task-specific fine-tuning when deployed on real-world robotic tasks. Method: This paper proposes Reinforcement-learning-driven Policy Distillation (RPD), the first framework enabling online action-guided distillation from a large VLA teacher model to a lightweight RL-based student policy. RPD synergistically integrates imitation learning, policy distillation, and proximal policy optimization (PPO). We instantiate RPD using Octo and OpenVLA as teachers and train student policies in the ManiSkill2 simulation platform. Results: The distilled compact student policies significantly outperform their VLA teachers across diverse dexterous manipulation tasksโ€”achieving superior convergence under both dense and sparse reward settings. They demonstrate robust cross-view generalization and strong zero-shot adaptation to unseen task variants, eliminating the need for additional fine-tuning.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent generalist Vision-Language-Action Models (VLAs) can perform a variety of tasks on real robots with remarkable generalization capabilities. However, reported success rates are often not on par with those of expert policies. Moreover, VLAs usually do not work out of the box and often must be fine-tuned as they are sensitive to setup changes. In this work, we present Refined Policy Distillation (RPD), an RL-based policy refinement method that enables the distillation of large generalist models into small, high-performing expert policies. The student policy is guided during the RL exploration by actions of a teacher VLA for increased sample efficiency and faster convergence. Different from previous work that focuses on applying VLAs to real-world experiments, we create fine-tuned versions of Octo and OpenVLA for ManiSkill2 to evaluate RPD in simulation. As our results for different manipulation tasks demonstrate, RPD enables the RL agent to learn expert policies that surpass the teacher's performance in both dense and sparse reward settings. Our approach is even robust to changes in the camera perspective and can generalize to task variations that the underlying VLA cannot solve.
Problem

Research questions and friction points this paper is trying to address.

Distill generalist VLAs into expert RL policies
Improve sample efficiency and convergence in RL
Enhance robustness to setup and task variations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Refined Policy Distillation for expert policies
RL-based refinement with teacher-student guidance
Robust to camera changes and task variations
๐Ÿ”Ž Similar Papers
No similar papers found.