LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation

📅 2025-05-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Predicting future visual states for robot-object interaction in embodied intelligence remains challenging, particularly in achieving pixel-level fidelity. Method: This paper introduces the first latent diffusion-based world model. It (1) pioneers the application of diffusion modeling to latent-space dynamics prediction within world models; (2) constructs a joint latent representation aligned with both geometric (DINO) and semantic (CLIP) features; and (3) designs an iterative diffusion strategy that jointly optimizes latent-state evolution and action generation. Results: On the LIBERO-LONG benchmark, our approach improves policy performance by 27.9%, increases task success rate in real-world scenarios by 20%, and significantly enhances cross-task generalization and deployment robustness.

Technology Category

Application Category

📝 Abstract
Predictive manipulation has recently gained considerable attention in the Embodied AI community due to its potential to improve robot policy performance by leveraging predicted states. However, generating accurate future visual states of robot-object interactions from world models remains a well-known challenge, particularly in achieving high-quality pixel-level representations. To this end, we propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling. Specifically, LaDi-WM leverages the well-established latent space aligned with pre-trained Visual Foundation Models (VFMs), which comprises both geometric features (DINO-based) and semantic features (CLIP-based). We find that predicting the evolution of the latent space is easier to learn and more generalizable than directly predicting pixel-level images. Building on LaDi-WM, we design a diffusion policy that iteratively refines output actions by incorporating forecasted states, thereby generating more consistent and accurate results. Extensive experiments on both synthetic and real-world benchmarks demonstrate that LaDi-WM significantly enhances policy performance by 27.9% on the LIBERO-LONG benchmark and 20% on the real-world scenario. Furthermore, our world model and policies achieve impressive generalizability in real-world experiments.
Problem

Research questions and friction points this paper is trying to address.

Predicting future visual states in robot-object interactions
Achieving high-quality pixel-level representations in world models
Enhancing robot policy performance via latent space diffusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses latent diffusion for future state prediction
Leverages pre-trained Visual Foundation Models
Iteratively refines actions with forecasted states
🔎 Similar Papers
No similar papers found.
Yuhang Huang
Yuhang Huang
National University of Defense Technology
Deep LearningComputer Vision
J
JIazhao Zhang
Peking University
S
SHilong Zou
National University of Defense and Technology
X
XInwang Liu
National University of Defense and Technology
R
Ruizhen Hu
Shenzhen University
K
Kai Xu
National University of Defense and Technology