🤖 AI Summary
Robotics world models face dual challenges of slow inference speed and insufficient physical plausibility in generated trajectories. This paper proposes KeyWorld: (1) a motion-aware keyframe selection mechanism that extracts semantically meaningful keyframes via trajectory simplification; (2) a DiT-based text-to-keyframe generation module for efficient, high-fidelity keyframe synthesis; and (3) a lightweight convolutional interpolation model to reconstruct full video sequences from sparse keyframes. Crucially, KeyWorld is the first method to explicitly incorporate motion priors into keyframe selection and to jointly optimize generative modeling with interpolation-based reconstruction—thereby ensuring physical consistency while significantly improving efficiency. On the LIBERO benchmark, KeyWorld achieves a 5.68× inference speedup over prior methods and substantially enhances the physical validity of generated videos, particularly on complex manipulation tasks.
📝 Abstract
Robotic world models are a promising paradigm for forecasting future environment states, yet their inference speed and the physical plausibility of generated trajectories remain critical bottlenecks, limiting their real-world applications. This stems from the redundancy of the prevailing frame-to-frame generation approach, where the model conducts costly computation on similar frames, as well as neglecting the semantic importance of key transitions. To address this inefficiency, we propose KeyWorld, a framework that improves text-conditioned robotic world models by concentrating transformers computation on a few semantic key frames while employing a lightweight convolutional model to fill the intermediate frames. Specifically, KeyWorld first identifies significant transitions by iteratively simplifying the robot's motion trajectories, obtaining the ground truth key frames. Then, a DiT model is trained to reason and generate these physically meaningful key frames from textual task descriptions. Finally, a lightweight interpolator efficiently reconstructs the full video by inpainting all intermediate frames. Evaluations on the LIBERO benchmark demonstrate that KeyWorld achieves a 5.68$ imes$ acceleration compared to the frame-to-frame generation baseline, and focusing on the motion-aware key frames further contributes to the physical validity of the generated videos, especially on complex tasks. Our approach highlights a practical path toward deploying world models in real-time robotic control and other domains requiring both efficient and effective world models. Code is released at https://anonymous.4open.science/r/Keyworld-E43D.