🤖 AI Summary
To address the lack of lightweight, open-source, and high-fidelity video prediction models for humanoid robots operating in human-centered environments, this paper introduces the first open-source world model architecture specifically designed for humanoid robotics. Methodologically, it integrates Masked Transformers, Flow Matching-based generative modeling, multi-variant attention, and action conditioning, coupled with an efficient parameter-sharing strategy that reduces model size by 33–53% without compromising visual fidelity. Trained on 100 hours of real-world egocentric demonstration data collected from humanoid robots, the model supports both single-frame and multi-frame action-conditioned video prediction. It enables efficient training and deployment on just 1–2 GPUs, significantly enhancing action reasoning and long-horizon planning capabilities in open-world settings.
📝 Abstract
Humanoid robots have the potential to perform complex tasks in human centered environments but require robust predictive models to reason about the outcomes of their actions. We introduce Humanoid World Models (HWM) a family of lightweight open source video based models that forecast future egocentric observations conditioned on actions. We train two types of generative models Masked Transformers and FlowMatching on 100 hours of humanoid demonstrations. Additionally we explore architectural variants with different attention mechanisms and parameter sharing strategies. Our parameter sharing techniques reduce model size by 33 to 53 with minimal impact on performance or visual fidelity. HWM is designed to be trained and deployed in practical academic and small lab settings such as 1 to 2 GPUs.