A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Real-world robotic reinforcement learning suffers from sparse rewards and inefficient exploration. This paper proposes VLAC, the first framework unifying vision-language models into a joint actor-critic architecture—eliminating handcrafted reward engineering and enabling end-to-end learning via co-modeling of dense process rewards and action generation. VLAC builds upon the InternVL foundation model and is trained multi-task on large-scale vision-language data, robot trajectory datasets, and human demonstrations. Key innovations include a prompt-driven reward-action co-generation mechanism, a hierarchical human-robot collaborative exploration protocol, and robustness enhancements via negative-sample mining and semantic misalignment augmentation. Evaluated on four real-world manipulation tasks, VLAC achieves a success rate increase from 30% to 90% within 200 episodes; with human-in-the-loop collaboration, sample efficiency improves by 50%, culminating in 100% success. Moreover, VLAC demonstrates zero-shot generalization across tasks and environments.

Technology Category

Application Category

📝 Abstract

Robotic real-world reinforcement learning (RL) with vision-language-action (VLA) models is bottlenecked by sparse, handcrafted rewards and inefficient exploration. We introduce VLAC, a general process reward model built upon InternVL and trained on large scale heterogeneous datasets. Given pairwise observations and a language goal, it outputs dense progress delta and done signal, eliminating task-specific reward engineering, and supports one-shot in-context transfer to unseen tasks and environments. VLAC is trained on vision-language datasets to strengthen perception, dialogic and reasoning capabilities, together with robot and human trajectories data that ground action generation and progress estimation, and additionally strengthened to reject irrelevant prompts as well as detect regression or stagnation by constructing large numbers of negative and semantically mismatched samples. With prompt control, a single VLAC model alternately generating reward and action tokens, unifying critic and policy. Deployed inside an asynchronous real-world RL loop, we layer a graded human-in-the-loop protocol (offline demonstration replay, return and explore, human guided explore) that accelerates exploration and stabilizes early learning. Across four distinct real-world manipulation tasks, VLAC lifts success rates from about 30% to about 90% within 200 real-world interaction episodes; incorporating human-in-the-loop interventions yields a further 50% improvement in sample efficiency and achieves up to 100% final success.

Problem

Research questions and friction points this paper is trying to address.

Overcoming sparse rewards in robotic reinforcement learning

Eliminating task-specific reward engineering requirements

Enabling efficient exploration in unseen environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLAC model unifies critic and policy via prompt control

Trained on heterogeneous datasets for perception and reasoning

Graded human-in-the-loop protocol accelerates real-world learning

🔎 Similar Papers

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics

2024-05-30Trans. Mach. Learn. Res.Citations: 1

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance

2024-05-22arXiv.orgCitations: 1