Reinforcement Learning in Vision: A Survey

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This survey systematically reviews the state of visual reinforcement learning (Visual RL), addressing the core challenge of tightly integrating visual perception with sequential decision-making. We formalize the Visual RL problem and trace methodological evolution from RLHF to verifiable reward modeling, highlighting emerging paradigms—including curriculum learning, preference-aligned diffusion, and unified reward modeling. Methodologically, we establish a technical framework grounded in multimodal large models, visual generation, unified architectures, and vision-language-action models, integrating policy optimization techniques (e.g., PPO, Group Relative Policy Optimization) with multimodal alignment and diffusion-based modeling. Our contribution includes a comprehensive analysis of 200+ representative works, the first fine-grained taxonomy for Visual RL, and an open-source benchmarking repository. We identify sample efficiency, cross-task generalization, and safe deployment as critical open challenges for future research.

Technology Category

Application Category

📝 Abstract
Recent advances at the intersection of reinforcement learning (RL) and visual intelligence have enabled agents that not only perceive complex visual scenes but also reason, generate, and act within them. This survey offers a critical and up-to-date synthesis of the field. We first formalize visual RL problems and trace the evolution of policy-optimization strategies from RLHF to verifiable reward paradigms, and from Proximal Policy Optimization to Group Relative Policy Optimization. We then organize more than 200 representative works into four thematic pillars: multi-modal large language models, visual generation, unified model frameworks, and vision-language-action models. For each pillar we examine algorithmic design, reward engineering, benchmark progress, and we distill trends such as curriculum-driven training, preference-aligned diffusion, and unified reward modeling. Finally, we review evaluation protocols spanning set-level fidelity, sample-level preference, and state-level stability, and we identify open challenges that include sample efficiency, generalization, and safe deployment. Our goal is to provide researchers and practitioners with a coherent map of the rapidly expanding landscape of visual RL and to highlight promising directions for future inquiry. Resources are available at: https://github.com/weijiawu/Awesome-Visual-Reinforcement-Learning.
Problem

Research questions and friction points this paper is trying to address.

Surveying reinforcement learning and visual intelligence integration advances
Organizing 200+ works into four thematic pillars analysis
Identifying evaluation protocols and open challenges in visual RL
Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy optimization from RLHF to verifiable rewards
Unified model frameworks and vision-language-action models
Curriculum training and preference-aligned diffusion methods
🔎 Similar Papers
No similar papers found.
Weijia Wu
Weijia Wu
National University of Singapore; Zhejiang University
Video GenerationLLMAIGC
C
Chen Gao
Show Lab, National University of Singapore
Joya Chen
Joya Chen
National University of Singapore
AI
Kevin Qinghong Lin
Kevin Qinghong Lin
University of Oxford; National U. of Singapore
Vision and LanguageVideo UnderstandingAI Agent
Q
Qingwei Meng
Zhejiang University
Y
Yiming Zhang
The Chinese University of Hong Kong
Y
Yuke Qiu
Zhejiang University
H
Hong Zhou
Zhejiang University
M
Mike Zheng Shou
Show Lab, National University of Singapore