🤖 AI Summary
This work addresses the challenge of building generative world models for humanoid robots, aiming to jointly model future visual observations (sampling) and discrete latent states (compression) to enable long-horizon reasoning and planning. We propose a dual-track prediction framework: the first track adapts the Wan-2.2 TI2V-5B video generation model via AdaLN-Zero and LoRA to condition synthesis on embodied robot states; the second track trains a spatiotemporal Transformer from scratch to directly predict discrete latent tokens. To our knowledge, this is the first adaptation of large-scale video diffusion models to embodied future prediction. We also introduce a unified evaluation benchmark for this task. Our method achieves state-of-the-art performance: 23.0 dB PSNR on visual sampling and 6.6386 top-500 cross-entropy on latent compression—both leading reported results.
📝 Abstract
World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.