WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video diffusion models suffer from geometric inconsistency and poor controllability in 3D/4D generation, while fine-tuning or retraining compromises pretrained knowledge and incurs high computational cost. This paper introduces the first training-free inference-time framework, comprising three core components: recursive optimization, flow-gated fusion, and self-correcting guidance. Leveraging latent-space optical flow analysis, motion-appearance disentangled injection, and dual-path contrastive self-correction, the method enables precise dynamic injection of trajectory priors. Crucially, it preserves the integrity of pretrained knowledge while significantly improving trajectory consistency, visual fidelity, and photorealism. Extensive evaluations across multiple benchmarks demonstrate superior performance and plug-and-play applicability.

Technology Category

Application Category

📝 Abstract
Recent video diffusion models demonstrate strong potential in spatial intelligence tasks due to their rich latent world priors. However, this potential is hindered by their limited controllability and geometric inconsistency, creating a gap between their strong priors and their practical use in 3D/4D tasks. As a result, current approaches often rely on retraining or fine-tuning, which risks degrading pretrained knowledge and incurs high computational costs. To address this, we propose WorldForge, a training-free, inference-time framework composed of three tightly coupled modules. Intra-Step Recursive Refinement introduces a recursive refinement mechanism during inference, which repeatedly optimizes network predictions within each denoising step to enable precise trajectory injection. Flow-Gated Latent Fusion leverages optical flow similarity to decouple motion from appearance in the latent space and selectively inject trajectory guidance into motion-related channels. Dual-Path Self-Corrective Guidance compares guided and unguided denoising paths to adaptively correct trajectory drift caused by noisy or misaligned structural signals. Together, these components inject fine-grained, trajectory-aligned guidance without training, achieving both accurate motion control and photorealistic content generation. Extensive experiments across diverse benchmarks validate our method's superiority in realism, trajectory consistency, and visual fidelity. This work introduces a novel plug-and-play paradigm for controllable video synthesis, offering a new perspective on leveraging generative priors for spatial intelligence.
Problem

Research questions and friction points this paper is trying to address.

Video diffusion models lack controllability and geometric consistency
Existing methods require retraining, risking knowledge loss and high costs
Need training-free framework for precise motion control and realistic generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free recursive refinement for precise guidance
Optical flow-gated latent fusion decouples motion
Dual-path self-corrective guidance prevents trajectory drift
🔎 Similar Papers
No similar papers found.
Chenxi Song
Chenxi Song
Westlake University & Jilin University
3DVIsion3D&4D Generation&Reconstruction
Yanming Yang
Yanming Yang
Westlake University
3D Vision
T
Tong Zhao
AGI Lab, School of Engineering, Westlake University, Hangzhou, China
Ruibo Li
Ruibo Li
Nanyang Technological University
C
Chi Zhang
AGI Lab, School of Engineering, Westlake University, Hangzhou, China