Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video diffusion models trained solely on raw videos struggle to learn geometry-aware 3D dynamic structures, resulting in spatially inconsistent generations. To address this, we propose Geometry Forcing—a novel framework that explicitly injects geometric constraints into the diffusion process for the first time. Specifically, it leverages features from a pre-trained geometric foundation model to enforce dual alignment—angular and scale alignment—on intermediate latent representations of the diffusion model. This is achieved via cosine similarity-based directional matching and unnormalized geometric feature regression, enabling effective fusion of geometric and video representations. Geometry Forcing significantly improves generation quality and 3D consistency under varying camera viewpoints and motion conditions. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks, outperforming existing methods in both visual fidelity and structural coherence.

Technology Category

Application Category

📝 Abstract
Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.
Problem

Research questions and friction points this paper is trying to address.

Video diffusion models lack 3D geometric-aware representations
Aligning diffusion models with 3D features improves world consistency
Enhancing video generation with geometry-aware directional and scale alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry Forcing aligns video diffusion with 3D representations
Uses Angular and Scale Alignment for geometric consistency
Improves 3D-aware video generation quality significantly
H
Haoyu Wu
Microsoft Research
D
Diankun Wu
Tsinghua University
Tianyu He
Tianyu He
Microsoft Research
machine learninggenerative modelsworld models
Junliang Guo
Junliang Guo
Microsoft Research
Deep LearningGenerative ModelsNatural Language Processing
Y
Yang Ye
Microsoft Research
Y
Yueqi Duan
Tsinghua University
J
Jiang Bian
Microsoft Research