VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work proposes On-Policy VLA Distillation (VLA-OPD), a framework addressing key challenges in post-deployment training of vision-language-action (VLA) models, including distributional shift, catastrophic forgetting, low sample efficiency in reinforcement learning (RL), and sparse rewards. VLA-OPD synergistically combines the efficiency of supervised fine-tuning (SFT) with the robustness of RL by leveraging an expert teacher to provide dense, token-level supervision over the agent’s self-generated trajectories. Operating in policy-induced states, the method enables active error correction while preserving pre-trained capabilities. A novel Reverse-KL objective is introduced for policy distillation, effectively mitigating entropy collapse or explosion, filtering the teacher’s cognitive uncertainty, and maintaining action diversity. Experiments demonstrate that VLA-OPD significantly outperforms pure RL and pure SFT baselines on the LIBERO and RoboTwin2.0 benchmarks, achieving a strong balance between sample efficiency and robustness while substantially alleviating catastrophic forgetting.

Technology Category

Application Category

📝 Abstract

Although pre-trained Vision-Language-Action (VLA) models exhibit impressive generalization in robotic manipulation, post-training remains crucial to ensure reliable performance during deployment. However, standard offline Supervised Fine-Tuning (SFT) suffers from distribution shifts and catastrophic forgetting of pre-trained capabilities, while online Reinforcement Learning (RL) struggles with sparse rewards and poor sample efficiency. In this paper, we propose On-Policy VLA Distillation (VLA-OPD), a framework bridging the efficiency of SFT with the robustness of RL. Instead of relying on sparse environmental rewards, VLA-OPD leverages an expert teacher to provide dense, token-level supervision on the student's self-generated trajectories. This enables active error correction on policy-induced states while preserving pre-trained general capabilities through gentle alignment. Crucially, we formulate VLA-OPD via a Reverse-KL objective. Unlike standard Forward-KL that induces mode-covering entropy explosion, or Hard-CE that causes premature entropy collapse, our bounded mode-seeking objective ensures stable policy learning by filtering out the teacher's epistemic uncertainty while maintaining action diversity. Experiments on LIBERO and RoboTwin2.0 benchmarks demonstrate that VLA-OPD significantly improves sample efficiency over RL and robustness over SFT, while effectively mitigating catastrophic forgetting during post-training.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

Supervised Fine-Tuning

Reinforcement Learning

catastrophic forgetting

sample efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

On-Policy Distillation

Vision-Language-Action Models

Reverse-KL Optimization