Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models

πŸ“… 2025-02-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses policy learning from reward-free offline trajectory data, aiming to achieve generalization and zero-shot transfer across environments and tasks. We propose a Joint Embedding Predictive Architecture (JEPA)-driven latent dynamics modeling framework integrated with model predictive control (MPC), contrasting model-free offline reinforcement learning. For the first time, we systematically demonstrate that this planning-centric paradigm significantly outperforms state-of-the-art model-free RL in three key dimensions: zero-shot transfer, trajectory stitching, and data efficiency. Our method achieves strong zero-shot generalization across diverse environmental layouts, enables cross-task planning with minimal data, and exhibits robustness to suboptimal trajectories. The core innovation lies in the tight integration of JEPA, latent-space dynamics modeling, and goal-conditioned MPCβ€”enabling efficient, robust, and transferable policy learning without reward supervision.

Technology Category

Application Category

πŸ“ Abstract
A long-standing goal in AI is to build agents that can solve a variety of tasks across different environments, including previously unseen ones. Two dominant approaches tackle this challenge: (i) reinforcement learning (RL), which learns policies through trial and error, and (ii) optimal control, which plans actions using a learned or known dynamics model. However, their relative strengths and weaknesses remain underexplored in the setting where agents must learn from offline trajectories without reward annotations. In this work, we systematically analyze the performance of different RL and control-based methods under datasets of varying quality. On the RL side, we consider goal-conditioned and zero-shot approaches. On the control side, we train a latent dynamics model using the Joint Embedding Predictive Architecture (JEPA) and use it for planning. We study how dataset properties-such as data diversity, trajectory quality, and environment variability-affect the performance of these approaches. Our results show that model-free RL excels when abundant, high-quality data is available, while model-based planning excels in generalization to novel environment layouts, trajectory stitching, and data-efficiency. Notably, planning with a latent dynamics model emerges as a promising approach for zero-shot generalization from suboptimal data.
Problem

Research questions and friction points this paper is trying to address.

Compare RL and control methods for offline learning.
Analyze dataset quality impact on method performance.
Evaluate zero-shot generalization using latent dynamics models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent dynamics model for planning
Joint Embedding Predictive Architecture
Zero-shot generalization from data
πŸ”Ž Similar Papers
No similar papers found.