Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

205K/year
🤖 AI Summary
This work addresses the sensitivity of existing world models to visual style variations—such as color and lighting changes—when deployed as simulators for vision-language-action (VLA) policies, which often leads to hallucinations, blurriness, or overexposure in long-horizon predictions and undermines simulation reliability. To mitigate this, the authors propose a style-robust world model that disentangles visual texture from task-relevant dynamics through structure-guided style augmentation. Additionally, a dynamic latent bootstrapping mechanism is introduced to maintain consistency between training and inference with minimal memory overhead. The proposed approach substantially improves generation quality, long-term temporal coherence, and closed-loop robustness, significantly outperforming WoVR on the LIBERO benchmark and achieving notable gains in generalization, simulation fidelity, and post-training VLA success rates.
📝 Abstract
The integration of Vision-Language-Action (VLA) models with World Models has gained increasing attention. One representative approach treats learned World Models as generative simulators, enabling policy optimization entirely within "imagination." However, when deployed as simulators for specific environments such as the LIBERO benchmark, existing World Models often suffer from poor generalization and long-horizon error accumulation. During closed-loop rollouts, these models are highly sensitive to initial-state perturbations; minor changes in color, illumination, and other visual factors can trigger cascading hallucinations, leading to severe blurriness or overexposure. Moreover, long-horizon error accumulation further degrades the quality and fidelity of predicted future states. These issues limit the reliability of World Models as simulators. To mitigate these problems, we propose Sword, a robust World Model framework. Our method introduces Structure-Guided Style Augmentation to disentangle the visual textures of interactive environments from task-relevant dynamics, thereby improving generalization. We further propose Dynamic Latent Bootstrapping, which maintains consistency between training and inference while keeping memory consumption low. Extensive experiments on the LIBERO benchmark show that our method significantly outperforms the baseline WoVR in terms of generalization, generation quality, robustness, fidelity, and the success rate of reinforcement-learning post-training for VLA models.
Problem

Research questions and friction points this paper is trying to address.

World Models
generalization
long-horizon error accumulation
visual perturbations
simulation fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structure-Guided Style Augmentation
Dynamic Latent Bootstrapping
World Models
Vision-Language-Action
Robust Simulation