Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the poor generalizability and high task-specific engineering cost of hand-crafted reward functions in reinforcement learning. We propose a video-driven reward generation framework that eliminates manual reward design. Its core innovation is the first integration of pre-trained video diffusion models into RL reward modeling: leveraging their implicitly learned world dynamics, the framework generates both video-level and frame-level goal-directed reward signals. To enhance semantic relevance, we employ CLIP-based filtering to identify key frames; additionally, forward–backward representation learning is introduced to improve temporal coherence of the policy. Evaluated on the Meta-World multi-task benchmark, our method significantly improves agent performance on complex visual-goal tasks while fully decoupling learning from task-specific reward engineering. Results demonstrate superior generalization across diverse manipulation tasks without requiring any hand-designed reward function.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) has achieved remarkable success in various domains, yet it often relies on carefully designed programmatic reward functions to guide agent behavior. Designing such reward functions can be challenging and may not generalize well across different tasks. To address this limitation, we leverage the rich world knowledge contained in pretrained video diffusion models to provide goal-driven reward signals for RL agents without ad-hoc design of reward. Our key idea is to exploit off-the-shelf video diffusion models pretrained on large-scale video datasets as informative reward functions in terms of video-level and frame-level goals. For video-level rewards, we first finetune a pretrained video diffusion model on domain-specific datasets and then employ its video encoder to evaluate the alignment between the latent representations of agent's trajectories and the generated goal videos. To enable more fine-grained goal-achievement, we derive a frame-level goal by identifying the most relevant frame from the generated video using CLIP, which serves as the goal state. We then employ a learned forward-backward representation that represents the probability of visiting the goal state from a given state-action pair as frame-level reward, promoting more coherent and goal-driven trajectories. Experiments on various Meta-World tasks demonstrate the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Designing programmatic reward functions for RL is challenging and lacks generalization.
We use pretrained video diffusion models to provide goal-driven rewards without manual design.
Our method evaluates video and frame alignment to guide agent behavior effectively.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video diffusion models provide goal-driven reward signals
Frame-level goals identified using CLIP for fine-grained control
Learned forward-backward representation estimates goal state probability
🔎 Similar Papers
No similar papers found.
Q
Qi Wang
Shanghai Jiao Tong University
M
Mian Wu
Shanghai Jiao Tong University
Yuyang Zhang
Yuyang Zhang
Graduate Student, Harvard University
Reinforcement LearningControl Theory
Mingqi Yuan
Mingqi Yuan
PhD candidate at HKPU
Machine Learning
Wenyao Zhang
Wenyao Zhang
PhD Student, Shanghai Jiaotong University
Robot Learning, Representation Learning
Haoxiang You
Haoxiang You
PhD Student, Yale University
RoboticsReinforcement learningMachine LearningControl TheoryOptimization
Y
Yunbo Wang
Shanghai Jiao Tong University
X
Xin Jin
Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo
X
Xiaokang Yang
Shanghai Jiao Tong University
W
Wenjun Zeng
Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo