Consistent World Models via Foresight Diffusion

πŸ“… 2025-05-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Diffusion models for world modeling often suffer from inconsistent trajectory generation due to entanglement between conditional understanding and target denoising within a shared architecture. To address this, we propose Foresight Diffusionβ€”a novel dual-stream decoupled framework: one deterministic prediction stream models temporal conditions, while the other incorporates distilled guidance representations from a pre-trained predictor to specialize in target denoising. This explicit separation of semantic conditioning from noise elimination overcomes an inherent consistency bottleneck in diffusion-based trajectory modeling. Evaluated on robotic video prediction and scientific spatiotemporal forecasting tasks, our method achieves significant improvements in both prediction accuracy and sample trajectory consistency, outperforming state-of-the-art diffusion models and streaming world model baselines.

Technology Category

Application Category

πŸ“ Abstract
Diffusion and flow-based models have enabled significant progress in generation tasks across various modalities and have recently found applications in world modeling. However, unlike typical generation tasks that encourage sample diversity, world models entail different sources of uncertainty and require consistent samples aligned with the ground-truth trajectory, which is a limitation we empirically observe in diffusion models. We argue that a key bottleneck in learning consistent diffusion-based world models lies in the suboptimal predictive ability, which we attribute to the entanglement of condition understanding and target denoising within shared architectures and co-training schemes. To address this, we propose Foresight Diffusion (ForeDiff), a diffusion-based world modeling framework that enhances consistency by decoupling condition understanding from target denoising. ForeDiff incorporates a separate deterministic predictive stream to process conditioning inputs independently of the denoising stream, and further leverages a pretrained predictor to extract informative representations that guide generation. Extensive experiments on robot video prediction and scientific spatiotemporal forecasting show that ForeDiff improves both predictive accuracy and sample consistency over strong baselines, offering a promising direction for diffusion-based world models.
Problem

Research questions and friction points this paper is trying to address.

Enhancing consistency in diffusion-based world models
Decoupling condition understanding from target denoising
Improving predictive accuracy in spatiotemporal forecasting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples condition understanding from target denoising
Uses deterministic predictive stream for conditioning
Leverages pretrained predictor to guide generation
πŸ”Ž Similar Papers
No similar papers found.
Y
Yu Zhang
School of Software, BNRist, Tsinghua University, China
Xingzhuo Guo
Xingzhuo Guo
Ph.D. Student, Tsinghua University
Transfer LearningDiffusion ModelsAI4Science
H
Haoran Xu
School of Software, BNRist, Tsinghua University, China