Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the challenge of building generative world models for humanoid robots, aiming to jointly model future visual observations (sampling) and discrete latent states (compression) to enable long-horizon reasoning and planning. We propose a dual-track prediction framework: the first track adapts the Wan-2.2 TI2V-5B video generation model via AdaLN-Zero and LoRA to condition synthesis on embodied robot states; the second track trains a spatiotemporal Transformer from scratch to directly predict discrete latent tokens. To our knowledge, this is the first adaptation of large-scale video diffusion models to embodied future prediction. We also introduce a unified evaluation benchmark for this task. Our method achieves state-of-the-art performance: 23.0 dB PSNR on visual sampling and 6.6386 top-500 cross-entropy on latent compression—both leading reported results.

Technology Category

Application Category

📝 Abstract

World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.

Problem

Research questions and friction points this paper is trying to address.

Develops world models for humanoid robots to predict future visual observations

Creates benchmark for real-world humanoid interaction with sampling and compression tracks

Adapts video generation models and transformers for frame prediction and latent coding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapted video foundation model for frame prediction

Used AdaLN-Zero and LoRA for robot state conditioning

Trained Spatio-Temporal Transformer from scratch for compression

🔎 Similar Papers

No similar papers found.