MVR: Multi-view Video Reward Shaping for Reinforcement Learning

πŸ“… 2026-03-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision-language model–based reward augmentation methods rely on single-view static images, making them susceptible to pose bias, and their linear reward composition can distort the optimal policy, limiting performance in complex dynamic tasks. To address these limitations, this work proposes MVR, a multi-view video reward shaping framework that learns state relevance through similarity between multi-view videos and textual descriptions, thereby mitigating occlusion and static-view bias inherent in single-view approaches. Furthermore, MVR introduces a state-dependent adaptive reward shaping mechanism that automatically attenuates the weight of vision-language guidance once the target action is achieved. Experiments on HumanoidBench and MetaWorld demonstrate the effectiveness of MVR, and ablation studies confirm the necessity of its key design components.

Technology Category

Application Category

πŸ“ Abstract
Reward design is of great importance for solving complex tasks with reinforcement learning. Recent studies have explored using image-text similarity produced by vision-language models (VLMs) to augment rewards of a task with visual feedback. A common practice linearly adds VLM scores to task or success rewards without explicit shaping, potentially altering the optimal policy. Moreover, such approaches, often relying on single static images, struggle with tasks whose desired behavior involves complex, dynamic motions spanning multiple visually different states. Furthermore, single viewpoints can occlude critical aspects of an agent's behavior. To address these issues, this paper presents Multi-View Video Reward Shaping (MVR), a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints. MVR leverages video-text similarity from a frozen pre-trained VLM to learn a state relevance function that mitigates the bias towards specific static poses inherent in image-based methods. Additionally, we introduce a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance, automatically reducing the influence of VLM guidance once the desired motion pattern is achieved. We confirm the efficacy of the proposed framework with extensive experiments on challenging humanoid locomotion tasks from HumanoidBench and manipulation tasks from MetaWorld, verifying the design choices through ablation studies.
Problem

Research questions and friction points this paper is trying to address.

reward shaping
vision-language models
multi-view video
reinforcement learning
state relevance
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view video
reward shaping
vision-language models
reinforcement learning
state relevance function
L
Lirui Luo
School of Intelligence Science and Technology, Peking University; State Key Laboratory of General Artificial Intelligence, BIGAI
G
Guoxi Zhang
State Key Laboratory of General Artificial Intelligence, BIGAI
H
Hongming Xu
State Key Laboratory of General Artificial Intelligence, BIGAI
Yaodong Yang
Yaodong Yang
Boya (εšι›…) Assistant Professor at Peking University
Reinforcement LearningAI AlignmentEmbodied AI
Cong Fang
Cong Fang
Peking University
machine learningoptmizationstatistics
Qing Li
Qing Li
Mohamed bin Zayed University of Artificial Intelligence: MBZUAI
Machine LearningLarge Language ModelOrdinary/Partial Differential Equation