Self-Improving Vision-Language-Action Models with Data Generation via Residual RL

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

To address the limitations of supervised fine-tuning (SFT)—namely its reliance on costly human demonstrations and poor generalization—this paper proposes PLD, a self-evolving framework for vision-language-action (VLA) models that requires no additional human annotation. Methodologically, PLD introduces: (1) a lightweight residual reinforcement learning agent that actively probes failure modes of a generic VLA model and generates corrective recovery trajectories; (2) a deployment-distribution-aware hybrid rollout strategy that jointly optimizes failure-region discovery and policy distribution alignment; and (3) a plug-and-play, three-stage self-improvement pipeline—“probe–collect–distill.” Evaluated on the LIBERO benchmark, PLD achieves 99% task success rate; it improves performance by over 50% on SimplerEnv; and attains 100% operational success on real-world Franka and YAM robotic arms—demonstrating substantial gains in generalization and scalability.

Technology Category

Application Category

📝 Abstract

Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a three-stage plug-and-play framework that improves VLAs through residual reinforcement learning (RL) and distribution-aware data collection. In Stage 1, we train lightweight residual actors to probe failure regions of the VLA generalist. In Stage 2, we use a hybrid rollout scheme that aligns collected trajectories with the generalist's deployment distribution while capturing recovery behaviors. In Stage 3, we distill the curated trajectories back into the generalist with standard SFT. PLD achieves near-saturated 99% task success on LIBERO, over 50% gains in SimplerEnv, and 100% success on real-world Franka and YAM arm manipulation tasks. Ablations show that residual probing and distribution-aware replay are key to collecting deployment-aligned data that improves both seen and unseen tasks, offering a scalable path toward self-improving VLA models.

Problem

Research questions and friction points this paper is trying to address.

Reducing reliance on costly human demonstrations for vision-language-action models

Improving generalization through residual RL and distribution-aware data collection

Enabling scalable self-improvement for robotic manipulation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Residual RL probes VLA failure regions

Hybrid rollout captures recovery behaviors

Distillation integrates curated trajectories via SFT

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling