🤖 AI Summary
Offline reinforcement learning (RL) suffers from suboptimal policies and biased value estimation due to the absence of environmental interaction. To address this, we propose an interactive world model framework grounded in natural videos—marking the first approach to leverage large-scale, unlabeled online video data as a prior knowledge source, without requiring annotations or domain alignment, thereby enabling cross-domain transfer of control commonsense and physical dynamics to target tasks. Our method integrates generative video modeling, implicit dynamics learning, model-guided policy distillation, and offline policy optimization. Evaluated on visual-motor control tasks—including robotic manipulation, autonomous driving, and open-world video games—it achieves over 100% average performance gain over state-of-the-art offline RL methods. Our core contributions are: (i) establishing a novel video-driven paradigm for world model construction, and (ii) realizing an end-to-end transfer pathway from natural video to embodied intelligent policies.
📝 Abstract
Offline reinforcement learning (RL) enables policy optimization in static datasets, avoiding the risks and costs of real-world exploration. However, it struggles with suboptimal behavior learning and inaccurate value estimation due to the lack of environmental interaction. In this paper, we present Video-Enhanced Offline RL (VeoRL), a model-based approach that constructs an interactive world model from diverse, unlabeled video data readily available online. Leveraging model-based behavior guidance, VeoRL transfers commonsense knowledge of control policy and physical dynamics from natural videos to the RL agent within the target domain. Our method achieves substantial performance gains (exceeding 100% in some cases) across visuomotor control tasks in robotic manipulation, autonomous driving, and open-world video games.