A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-world robotic reinforcement learning suffers from sparse rewards and inefficient exploration. This paper proposes VLAC, the first framework unifying vision-language models into a joint actor-critic architecture—eliminating handcrafted reward engineering and enabling end-to-end learning via co-modeling of dense process rewards and action generation. VLAC builds upon the InternVL foundation model and is trained multi-task on large-scale vision-language data, robot trajectory datasets, and human demonstrations. Key innovations include a prompt-driven reward-action co-generation mechanism, a hierarchical human-robot collaborative exploration protocol, and robustness enhancements via negative-sample mining and semantic misalignment augmentation. Evaluated on four real-world manipulation tasks, VLAC achieves a success rate increase from 30% to 90% within 200 episodes; with human-in-the-loop collaboration, sample efficiency improves by 50%, culminating in 100% success. Moreover, VLAC demonstrates zero-shot generalization across tasks and environments.

Technology Category

Application Category

📝 Abstract
Robotic real-world reinforcement learning (RL) with vision-language-action (VLA) models is bottlenecked by sparse, handcrafted rewards and inefficient exploration. We introduce VLAC, a general process reward model built upon InternVL and trained on large scale heterogeneous datasets. Given pairwise observations and a language goal, it outputs dense progress delta and done signal, eliminating task-specific reward engineering, and supports one-shot in-context transfer to unseen tasks and environments. VLAC is trained on vision-language datasets to strengthen perception, dialogic and reasoning capabilities, together with robot and human trajectories data that ground action generation and progress estimation, and additionally strengthened to reject irrelevant prompts as well as detect regression or stagnation by constructing large numbers of negative and semantically mismatched samples. With prompt control, a single VLAC model alternately generating reward and action tokens, unifying critic and policy. Deployed inside an asynchronous real-world RL loop, we layer a graded human-in-the-loop protocol (offline demonstration replay, return and explore, human guided explore) that accelerates exploration and stabilizes early learning. Across four distinct real-world manipulation tasks, VLAC lifts success rates from about 30% to about 90% within 200 real-world interaction episodes; incorporating human-in-the-loop interventions yields a further 50% improvement in sample efficiency and achieves up to 100% final success.
Problem

Research questions and friction points this paper is trying to address.

Overcoming sparse rewards in robotic reinforcement learning
Eliminating task-specific reward engineering requirements
Enabling efficient exploration in unseen environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLAC model unifies critic and policy via prompt control
Trained on heterogeneous datasets for perception and reasoning
Graded human-in-the-loop protocol accelerates real-world learning
🔎 Similar Papers
No similar papers found.
S
Shaopeng Zhai
Shanghai AI Lab
Q
Qi Zhang
Shanghai AI Lab
T
Tianyi Zhang
Shanghai AI Lab
Fuxian Huang
Fuxian Huang
Shanghai AI Lab
H
Haoran Zhang
Shanghai AI Lab
M
Ming Zhou
Shanghai AI Lab
S
Shengzhe Zhang
Shanghai AI Lab
L
Litao Liu
Shanghai AI Lab
Sixu Lin
Sixu Lin
Shanghai AI Laboratory
Robot LearningReinforcement Learning
J
Jiangmiao Pang
Shanghai AI Lab