GoalLadder: Incremental Goal Discovery with Vision-Language Models

📅 2025-06-19

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenge of solving tasks in vision-based reinforcement learning using only a single natural language instruction. We propose a progressive goal discovery framework that generates end-to-end dense rewards without relying on handcrafted reward functions, dense human feedback, or non-visual representations. Instead, it leverages a vision-language model (VLM) to identify and rank intermediate states closer to the goal within an unlabeled visual embedding space. To mitigate VLM hallucination and calibration errors, we introduce an ELO-based VLM feedback denoising mechanism that dynamically refines model confidence, avoiding blind trust in VLM outputs. Evaluated on classic control and robotic manipulation benchmarks, our method achieves a mean final success rate of 95%, substantially outperforming the best prior baseline (45%) and significantly reducing dependence on human supervision.

Technology Category

Application Category

📝 Abstract

Natural language can offer a concise and human-interpretable means of specifying reinforcement learning (RL) tasks. The ability to extract rewards from a language instruction can enable the development of robotic systems that can learn from human guidance; however, it remains a challenging problem, especially in visual environments. Existing approaches that employ large, pretrained language models either rely on non-visual environment representations, require prohibitively large amounts of feedback, or generate noisy, ill-shaped reward functions. In this paper, we propose a novel method, $ extbf{GoalLadder}$, that leverages vision-language models (VLMs) to train RL agents from a single language instruction in visual environments. GoalLadder works by incrementally discovering states that bring the agent closer to completing a task specified in natural language. To do so, it queries a VLM to identify states that represent an improvement in agent's task progress and to rank them using pairwise comparisons. Unlike prior work, GoalLadder does not trust VLM's feedback completely; instead, it uses it to rank potential goal states using an ELO-based rating system, thus reducing the detrimental effects of noisy VLM feedback. Over the course of training, the agent is tasked with minimising the distance to the top-ranked goal in a learned embedding space, which is trained on unlabelled visual data. This key feature allows us to bypass the need for abundant and accurate feedback typically required to train a well-shaped reward function. We demonstrate that GoalLadder outperforms existing related methods on classic control and robotic manipulation environments with the average final success rate of $sim$95% compared to only $sim$45% of the best competitor.

Problem

Research questions and friction points this paper is trying to address.

Extracting rewards from language instructions in visual environments

Reducing noisy feedback from vision-language models for RL training

Training RL agents with minimal human feedback and high success rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision-language models for RL training

Incrementally discovers task-progress states

Employs ELO-based ranking for noisy feedback

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling