Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Video diffusion models trained solely on raw videos struggle to learn geometry-aware 3D dynamic structures, resulting in spatially inconsistent generations. To address this, we propose Geometry Forcing—a novel framework that explicitly injects geometric constraints into the diffusion process for the first time. Specifically, it leverages features from a pre-trained geometric foundation model to enforce dual alignment—angular and scale alignment—on intermediate latent representations of the diffusion model. This is achieved via cosine similarity-based directional matching and unnormalized geometric feature regression, enabling effective fusion of geometric and video representations. Geometry Forcing significantly improves generation quality and 3D consistency under varying camera viewpoints and motion conditions. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks, outperforming existing methods in both visual fidelity and structural coherence.

Technology Category

Application Category

📝 Abstract

Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.

Problem

Research questions and friction points this paper is trying to address.

Video diffusion models lack 3D geometric-aware representations

Aligning diffusion models with 3D features improves world consistency

Enhancing video generation with geometry-aware directional and scale alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry Forcing aligns video diffusion with 3D representations

Uses Angular and Scale Alignment for geometric consistency

Improves 3D-aware video generation quality significantly

🔎 Similar Papers

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos