SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of existing vision-language models in reinforcement learning to partial observability and distribution shifts, which often leads policies to rely on perceptual artifacts rather than genuine task progress. To overcome this, the authors propose SOLE-R1, a video-language reasoning model specifically designed for online reinforcement learning that operates solely on raw video inputs and natural language task descriptions. SOLE-R1 leverages spatiotemporal chain-of-thought reasoning to produce dense estimates of task progress, which are directly used as reward signals. This approach achieves, for the first time, zero-shot online reinforcement learning without requiring ground-truth rewards, success labels, demonstrations, or task-specific fine-tuning, thereby significantly enhancing robustness against reward hacking. Evaluated across four simulated environments and a real robot, SOLE-R1 successfully completes 24 previously unseen manipulation tasks, substantially outperforming strong baselines such as GPT-5 and Gemini-3-Pro.
📝 Abstract
Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today's strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including GPT-5 and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
reinforcement learning
reward hacking
partial observability
distribution shift
Innovation

Methods, ideas, or system contributions that make the work stand out.

video-language reasoning
spatiotemporal chain-of-thought
reward-free reinforcement learning
online robot learning
reward hacking robustness
🔎 Similar Papers
No similar papers found.