🤖 AI Summary
This work identifies a fundamental flaw in existing latent-space dynamics models—such as the Recurrent State-Space Model (RSSM) used in Dreamer—regarding their inability to reliably quantify epistemic uncertainty, thereby undermining their utility for guiding exploration or preventing model misuse. The study systematically demonstrates that latent state transitions are prone to being drawn toward attractors in high-density regions of the latent space, causing the latent variables to misrepresent true environmental dynamics. Critically, when such attractors coincide with high-reward regions, trajectory replay systematically overestimates future rewards. Through rigorous analysis of recursive state-space modeling, uncertainty quantification, and associated biases, this research elucidates the root causes behind the failure of current uncertainty-guided exploration strategies, offering crucial insights for advancing world model design.
📝 Abstract
Model-Based Reinforcement Learning distinguishes between physical dynamics models operating on proprioceptive inputs and latent dynamics models operating on high-dimensional image observations. A prominent latent approach is the Recurrent State Space Model used in the Dreamer family. While epistemic uncertainty quantification to inform exploration and mitigate model exploitation is well established for physical dynamics models, its transfer to latent dynamics models has received limited scrutiny. We empirically demonstrate that latent transitions are biased toward well-represented regions of latent space, exhibiting an attractor behavior that can deviate from true environment dynamics. As a result, discrepancies in environment dynamics may not manifest in latent space, undermining the reliability of epistemic uncertainty estimates. Because these attractors often lie in high-reward regions, latent rollouts systematically overestimate predicted rewards. Our findings highlight key limitations of epistemic uncertainty estimation in latent dynamics models and motivate more critical evaluation of this method.