🤖 AI Summary
Despite their strong visual dynamics modeling capabilities, World Foundation Models (WFMs) suffer from a gap between photorealistic generation fidelity and precise control accuracy, limiting their applicability in exact robotic manipulation. This paper proposes a lightweight adaptation framework that transforms WFMs into prediction-oriented, task-specific manipulation models, integrated with Model Predictive Control (MPC) for efficient policy guidance. We introduce two key innovations: spatiotemporal test-time training and memory persistence—enabling adaptive inference-time adjustment and long-horizon consistency maintenance without policy retraining. Evaluated on the LIBERO benchmark, our method achieves over a 41% improvement in task success rate, while preserving computational efficiency, cross-task generalizability, and high-fidelity action precision. This effectively bridges the longstanding gap between generative realism and actionable control accuracy in vision-based robotic learning.
📝 Abstract
World Foundation Models (WFMs) offer remarkable visual dynamics simulation capabilities, yet their application to precise robotic control remains limited by the gap between generative realism and control-oriented precision. While existing approaches use WFMs as synthetic data generators, they suffer from high computational costs and underutilization of pre-trained VLA policies. We introduce extbf{AdaPower} ( extbf{Ada}pt and Em extbf{power}), a lightweight adaptation framework that transforms general-purpose WFMs into specialist world models through two novel components: Temporal-Spatial Test-Time Training (TS-TTT) for inference-time adaptation and Memory Persistence (MP) for long-horizon consistency. Integrated within a Model Predictive Control framework, our adapted world model empowers pre-trained VLAs, achieving over 41% improvement in task success rates on LIBERO benchmarks without policy retraining, while preserving computational efficiency and generalist capabilities.