Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing vision-language models (VLMs) struggle to map raw visual inputs to generalizable, language-conditioned action sequences, while conventional reinforcement learning approaches suffer from reliance on dense rewards, sensitivity to hyperparameter tuning, and poor cross-environment transferability. To address these limitations, we propose Vision-Language Decoupled Actor-Critic (VL-DAC), a novel algorithm built upon the PPO framework that decouples the action policy—updated token-wise—from the value function—learned step-wise—enabling stable training without manual hyperparameter optimization. VL-DAC achieves behavioral generalization across diverse synthetic environments (MiniWorld, ALFWorld) and, for the first time, demonstrates significant performance gains on real-world benchmarks: +50% on BALROG, +5% on VSI-Bench, and +2% on VisualWebBench—all while preserving original image understanding capabilities. These results validate the efficacy of low-cost synthetic environment pretraining for downstream real-world vision-language navigation and interaction tasks.

Technology Category

Application Category

📝 Abstract

Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions -- a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) already produces policies that generalize widely: +50% relative on BALROG (game-centric agentic control), +5% relative on the hardest part of VSI-Bench (spatial planning), and +2% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Improving vision-language models for real-world action sequences

Overcoming generalization issues in reinforcement learning for VLMs

Enhancing VLM training efficiency with synthetic environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

VL-DAC decouples actor-critic updates for VLMs

Trains VLMs in synthetic worlds for real-world tasks

Achieves generalization across diverse benchmarks

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling