Model Predictive Control with Differentiable World Models for Offline Reinforcement Learning

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an inference-time adaptation framework for offline reinforcement learning that integrates a differentiable world model with model predictive control (MPC), enabling end-to-end gradient-based optimization of pretrained policy parameters through imagined trajectories. Unlike conventional offline RL approaches that deploy fixed policies incapable of leveraging environmental dynamics during inference, the proposed method overcomes the static deployment limitation by adaptively refining policies at test time. Evaluated on the D4RL benchmark—including MuJoCo locomotion and AntMaze tasks—the approach significantly outperforms strong existing baselines, demonstrating both the effectiveness and practicality of online policy optimization during inference.

Technology Category

Application Category

📝 Abstract
Offline Reinforcement Learning (RL) aims to learn optimal policies from fixed offline datasets, without further interactions with the environment. Such methods train an offline policy (or value function), and apply it at inference time without further refinement. We introduce an inference time adaptation framework inspired by model predictive control (MPC) that utilizes a pretrained policy along with a learned world model of state transitions and rewards. While existing world model and diffusion-planning methods use learned dynamics to generate imagined trajectories during training, or to sample candidate plans at inference time, they do not use inference-time information to optimize the policy parameters on the fly. In contrast, our design is a Differentiable World Model (DWM) pipeline that enables endto-end gradient computation through imagined rollouts for policy optimization at inference time based on MPC. We evaluate our algorithm on D4RL continuous-control benchmarks (MuJoCo locomotion tasks and AntMaze), and show that exploiting inference-time information to optimize the policy parameters yields consistent gains over strong offline RL baselines.
Problem

Research questions and friction points this paper is trying to address.

Offline Reinforcement Learning
Model Predictive Control
Differentiable World Model
Inference-time Adaptation
Policy Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable World Model
Model Predictive Control
Offline Reinforcement Learning
Inference-time Adaptation
End-to-end Gradient Optimization
🔎 Similar Papers
No similar papers found.
R
Rohan Deb
Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign, IL, USA
S
Stephen J. Wright
Department of Computer Sciences, University of Wisconsin Madison, WI, USA
Arindam Banerjee
Arindam Banerjee
Founder Professor, Dept of Computer Science, University of Illinois Urbana-Champaign
Machine LearningData Mining