Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Existing vision-language models (VLMs) struggle with low sample efficiency, reliance on handcrafted designs, and myopic training when applied to long-horizon (100+ timesteps) game-playing tasks. This work proposes a lightweight, turn-level critic variant of Proximal Policy Optimization (PPO) tailored for such settings, leveraging strong action priors from a pretrained VLM to enable efficient policy optimization without end-to-end retraining. Evaluated on multiple levels of Super Mario Land, the method achieves over a threefold improvement in average completion progress, substantially enhancing sample efficiency and reducing manual intervention. It also demonstrates strong in-game and cross-game generalization while preserving the VLM’s performance on general-domain tasks.

📝 Abstract

Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20--30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario Land, a visually grounded environment requiring 100+ turns of interaction with coordinated perception, reasoning, and action. We begin with a systematic investigation of key algorithmic components and propose an adapted variant of PPO with a lightweight turn-level critic, which substantially improves training stability and sample efficiency over critic-free methods such as GRPO and Reinforce++. We further show that pretrained VLMs provide strong action priors, significantly improving sample efficiency during RL training and reducing the need for manual design choices such as action engineering, compared to classical deep RL trained from scratch. Building on these insights, we introduce Odysseus, an open training framework for VLM agents, achieving substantial gains across multiple levels of the game and at least 3 times average game progresses than frontier models. Moreover, the trained models exhibit consistent improvements under both in-game and cross-game generalization settings, while maintaining general-domain capabilities. Overall, our results identify key ingredients for making RL stable and effective in long-horizon, multi-modal settings, and provide practical guidance for developing VLMs as embodied agents.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

long-horizon decision-making

reinforcement learning

interactive games

embodied agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models

Reinforcement Learning

Long-Horizon Decision-Making