🤖 AI Summary
Current visual world models exhibit sensitivity to training distribution coverage, leading to planning failures under out-of-distribution (OOD) states during inference. To address this, we propose a novel VAE-based novelty detection mechanism—integrated for the first time into the DINO-WM world model architecture—that identifies OOD states in latent space in real time and constrains predictive control trajectories to mitigate distributional shift. Our method leverages a pretrained vision backbone for feature extraction and employs VAE reconstruction error to quantify state novelty, enabling distribution-aware robust decision-making within closed-loop planning. Evaluated across multiple challenging simulated robotic tasks, the approach significantly improves data efficiency and planning stability, consistently outperforming state-of-the-art methods—even when trained on extremely limited data.
📝 Abstract
Recent work on visual world models shows significant promise in latent state dynamics obtained from pre-trained image backbones. However, most of the current approaches are sensitive to training quality, requiring near-complete coverage of the action and state space during training to prevent divergence during inference. To make a model-based planning algorithm more robust to the quality of the learned world model, we propose in this work to use a variational autoencoder as a novelty detector to ensure that proposed action trajectories during planning do not cause the learned model to deviate from the training data distribution. To evaluate the effectiveness of this approach, a series of experiments in challenging simulated robot environments was carried out, with the proposed method incorporated into a model-predictive control policy loop extending the DINO-WM architecture. The results clearly show that the proposed method improves over state-of-the-art solutions in terms of data efficiency.