On Training in Imagination

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work investigates the impact of dynamics and reward model errors on policy performance in imagination-based trajectory learning. By extending MDP error analysis to settings with learned reward models, the authors reformulate policy optimization under reward noise as a one-dimensional problem, targeting representations with low Lipschitz constants. Integrating theoretical error bounds, REINFORCE gradient estimation, and sample complexity analysis, they derive the optimal allocation ratio between dynamics and reward samples under a fixed sampling budget. The analysis theoretically establishes that zero-mean reward noise introduces no bias and that its variance decays with the number of imagined trajectories, yielding clear practical guidelines for sample-efficient policy training.

📝 Abstract

State-of-the-art model-based reinforcement learning methods train policies on imagined rollouts. These rollouts are trajectories generated by a learned dynamics model and are scored by a learned reward model, but without querying the true environment during policy updates. We study this training paradigm by quantifying how errors in learned dynamics and reward models affect returns and policy optimization. First, we extend the analysis of Asadi et al. (2018) to MDPs with learned reward models, and derive the optimal sample allocation--the ratio of dynamics samples to reward samples that minimizes a bound on return error under power-law scaling assumptions. We identify lower Lipschitz constants of the learned dynamics, reward, and policy as a representation desideratum that tightens this bound, and we connect this perspective to the temporal-straightening objective of Wang et al. (2026). Second, we examine how policy optimization with REINFORCE tolerates noisy rewards, which are often cheaper to obtain. We show that zero-mean reward noise leaves the gradient estimator unbiased and adds at most a variance term that decreases with the number of rollouts. This introduces a practical tradeoff: given a fixed budget, should one buy more rollouts with cheaper but noisier rewards, or fewer rollouts with more expensive but less noisy rewards? We reduce this choice to a one-dimensional optimization problem and characterize the optimum.

Problem

Research questions and friction points this paper is trying to address.

model-based reinforcement learning

imagined rollouts

reward noise

dynamics model error

policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

model-based reinforcement learning

imagined rollouts

Lipschitz constant