G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Despite their strong multimodal understanding capabilities, vision-language models (VLMs) exhibit weak decision-making performance in interactive visual environments—e.g., games—revealing a significant “knowing-but-not-doing” gap. To address this, we introduce VLM-Gym: a composable, difficulty-controllable reinforcement learning benchmark comprising multiple visually rich games. We further propose G1, a novel training framework featuring perception-augmented cold-start initialization and multi-task parallel RL fine-tuning. G1 is the first to empirically uncover the synergistic emergence and mutual guidance between perceptual and reasoning capabilities during training. It enables cross-game generalization and capability self-bootstrapping, achieving state-of-the-art performance across diverse visual games—surpassing both teacher models and Claude-3.7-Sonnet-Thinking. All code, environments, and models are publicly released.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) excel in many direct multimodal tasks but struggle to translate this prowess into effective decision-making within interactive, visually rich environments like games. This ``knowing-doing'' gap significantly limits their potential as autonomous agents, as leading VLMs often performing badly in simple games. To address this, we introduce VLM-Gym, a curated reinforcement learning (RL) environment featuring diverse visual games with unified interfaces and adjustable, compositional difficulty, specifically designed for scalable multi-game parallel training. Leveraging VLM-Gym, we train G0 models using pure RL-driven self-evolution, which demonstrate emergent perception and reasoning patterns. To further mitigate challenges arising from game diversity, we develop G1 models. G1 incorporates a perception-enhanced cold start prior to RL fine-tuning. Our resulting G1 models consistently surpass their teacher across all games and outperform leading proprietary models like Claude-3.7-Sonnet-Thinking. Systematic analysis reveals an intriguing finding: perception and reasoning abilities mutually bootstrap each other throughout the RL training process. Source code including VLM-Gym and RL training are released at https://github.com/chenllliang/G1 to foster future research in advancing VLMs as capable interactive agents.
Problem

Research questions and friction points this paper is trying to address.

VLMs struggle in decision-making within interactive visual environments
Addressing the 'knowing-doing' gap in Vision-Language Models via RL
Enhancing perception and reasoning in VLMs through self-evolution
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-Gym enables scalable multi-game RL training
G1 uses perception-enhanced cold start before RL
Mutual bootstrapping of perception and reasoning abilities
🔎 Similar Papers
No similar papers found.
L
Liang Chen
Peking University, Moonshot AI
Hongcheng Gao
Hongcheng Gao
University of Chinese Academy of Sciences
Natural Language ProcessingLarge Language ModelsVision Language Models
T
Tianyu Liu
Peking University
Z
Zhiqi Huang
Moonshot AI
Flood Sung
Flood Sung
Moonshot AI
Foundation ModelsLLM/VLMAgentReinforcement LearningMeta Learning
X
Xinyu Zhou
Moonshot AI
Y
Yuxin Wu
Moonshot AI
B
Baobao Chang
Peking University