MinD: Unified Visual Imagination and Control via Hierarchical World Models

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video generation models (VGMs) face two key bottlenecks in unified world modeling for robotics: high generation latency and misalignment between visual imagination and executable actions. To address these, we propose MinD—a hierarchical diffusion model–driven framework for joint visual–action modeling. MinD introduces a dual-system architecture and the DiffMatcher module, which enforces tight coupling between low-frequency video prediction and high-frequency action policy via diffusion-based alignment and independent scheduling. It further incorporates latent-space diffusion matching and task-prior risk assessment to enable real-time closed-loop control and feasibility pre-evaluation. Evaluated on RL-Bench, MinD achieves over 63% success rate—the new state-of-the-art—demonstrating, for the first time within a unified framework, concurrent high-fidelity video prediction, precise action generation, and task feasibility reasoning.

Technology Category

Application Category

📝 Abstract
Video generation models (VGMs) offer a promising pathway for unified world modeling in robotics by integrating simulation, prediction, and manipulation. However, their practical application remains limited due to (1) slowgeneration speed, which limits real-time interaction, and (2) poor consistency between imagined videos and executable actions. To address these challenges, we propose Manipulate in Dream (MinD), a hierarchical diffusion-based world model framework that employs a dual-system design for vision-language manipulation. MinD executes VGM at low frequencies to extract video prediction features, while leveraging a high-frequency diffusion policy for real-time interaction. This architecture enables low-latency, closed-loop control in manipulation with coherent visual guidance. To better coordinate the two systems, we introduce a video-action diffusion matching module (DiffMatcher), with a novel co-training strategy that uses separate schedulers for each diffusion model. Specifically, we introduce a diffusion-forcing mechanism to DiffMatcher that aligns their intermediate representations during training, helping the fast action model better understand video-based predictions. Beyond manipulation, MinD also functions as a world simulator, reliably predicting task success or failure in latent space before execution. Trustworthy analysis further shows that VGMs can preemptively evaluate task feasibility and mitigate risks. Extensive experiments across multiple benchmarks demonstrate that MinD achieves state-of-the-art manipulation (63%+) in RL-Bench, advancing the frontier of unified world modeling in robotics.
Problem

Research questions and friction points this paper is trying to address.

Slow video generation speed limits real-time interaction
Poor consistency between imagined videos and executable actions
Need for unified world modeling in robotics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical diffusion-based world model framework
Dual-system design for vision-language manipulation
Diffusion-forcing mechanism aligns intermediate representations
🔎 Similar Papers
No similar papers found.
Xiaowei Chi
Xiaowei Chi
The Hong Kong University of Science and Technology
Multimodal GenerationRoboticsComputer Vision
Kuangzhi Ge
Kuangzhi Ge
Peking University
Multimodal LearningEmbodied AI
J
Jiaming Liu
Peking University
S
Siyuan Zhou
Tencent Robotics X; Hong Kong University of Science and Technology
P
Peidong Jia
Peking University
Z
Zichen He
Peking University
Y
Yuzhen Liu
Tencent Robotics X
Tingguang Li
Tingguang Li
Tencent Robotics X
Reinforcement LearningRobotics
L
Lei Han
Tencent Robotics X
Sirui Han
Sirui Han
The Hong Kong University of Science and Technology
Large Language ModelInterdisciplinary Artificial Intelligence
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models
Y
Yike Guo
Hong Kong University of Science and Technology