🤖 AI Summary
This work addresses the challenge of solving tasks in vision-based reinforcement learning using only a single natural language instruction. We propose a progressive goal discovery framework that generates end-to-end dense rewards without relying on handcrafted reward functions, dense human feedback, or non-visual representations. Instead, it leverages a vision-language model (VLM) to identify and rank intermediate states closer to the goal within an unlabeled visual embedding space. To mitigate VLM hallucination and calibration errors, we introduce an ELO-based VLM feedback denoising mechanism that dynamically refines model confidence, avoiding blind trust in VLM outputs. Evaluated on classic control and robotic manipulation benchmarks, our method achieves a mean final success rate of 95%, substantially outperforming the best prior baseline (45%) and significantly reducing dependence on human supervision.
📝 Abstract
Natural language can offer a concise and human-interpretable means of specifying reinforcement learning (RL) tasks. The ability to extract rewards from a language instruction can enable the development of robotic systems that can learn from human guidance; however, it remains a challenging problem, especially in visual environments. Existing approaches that employ large, pretrained language models either rely on non-visual environment representations, require prohibitively large amounts of feedback, or generate noisy, ill-shaped reward functions. In this paper, we propose a novel method, $ extbf{GoalLadder}$, that leverages vision-language models (VLMs) to train RL agents from a single language instruction in visual environments. GoalLadder works by incrementally discovering states that bring the agent closer to completing a task specified in natural language. To do so, it queries a VLM to identify states that represent an improvement in agent's task progress and to rank them using pairwise comparisons. Unlike prior work, GoalLadder does not trust VLM's feedback completely; instead, it uses it to rank potential goal states using an ELO-based rating system, thus reducing the detrimental effects of noisy VLM feedback. Over the course of training, the agent is tasked with minimising the distance to the top-ranked goal in a learned embedding space, which is trained on unlabelled visual data. This key feature allows us to bypass the need for abundant and accurate feedback typically required to train a well-shaped reward function. We demonstrate that GoalLadder outperforms existing related methods on classic control and robotic manipulation environments with the average final success rate of $sim$95% compared to only $sim$45% of the best competitor.