VSC-RL: Advancing Autonomous Vision-Language Agents with Variational Subgoal-Conditioned Reinforcement Learning

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address low sample efficiency in vision-language sequential decision-making under sparse rewards and long-horizon dependencies, this paper proposes a variational subgoal-conditioned reinforcement learning framework. We formulate the task as a variational goal-conditioned RL problem and introduce the SubGoal-Conditioned Evidence Lower Bound (SGC-ELBO)—the first theoretical framework enabling joint subgoal decomposition and variational optimization driven entirely by vision-language models (VLMs). We prove its equivalence to the original objective. By jointly optimizing subgoal-conditioned return and policy alignment—augmented with policy regularization and VLM-generated subgoals—the method achieves state-of-the-art performance across diverse benchmarks, including real-world mobile device manipulation. Empirical results show up to 3.2× improvement in learning efficiency and an average 27.6% gain in task success rate over prior methods.

Technology Category

Application Category

📝 Abstract

State-of-the-art (SOTA) reinforcement learning (RL) methods enable the vision-language agents to learn from interactions with the environment without human supervision. However, they struggle with learning inefficiencies in tackling real-world complex sequential decision-making tasks, especially with sparse reward signals and long-horizon dependencies. To effectively address the issue, we introduce Variational Subgoal-Conditioned RL (VSC-RL), which reformulates the vision-language sequential decision-making task as a variational goal-conditioned RL problem, allowing us to leverage advanced optimization methods to enhance learning efficiency. Specifically, VSC-RL optimizes the SubGoal Evidence Lower BOund (SGC-ELBO), which consists of (a) maximizing the subgoal-conditioned return via RL and (b) minimizing the subgoal-conditioned difference with the reference policy. We theoretically demonstrate that SGC-ELBO is equivalent to the original optimization objective, ensuring improved learning efficiency without sacrificing performance guarantees. Additionally, for real-world complex decision-making tasks, VSC-RL leverages the vision-language model to autonomously decompose the goal into feasible subgoals, enabling efficient learning. Across various benchmarks, including challenging real-world mobile device control tasks, VSC-RL significantly outperforms the SOTA vision-language agents, achieving superior performance and remarkable improvement in learning efficiency.

Problem

Research questions and friction points this paper is trying to address.

Improve autonomous vision-language agents' decision-making efficiency

Address sparse rewards in complex sequential tasks

Decompose goals into feasible subgoals for better learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational Subgoal-Conditioned RL

Optimizes SubGoal Evidence Lower Bound

Autonomous goal decomposition with vision-language model

🔎 Similar Papers

Navigation with VLM framework: Go to Any Language