DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses a critical limitation in current video generation models, where reconstruction objectives fail to distinguish physically plausible from implausible dynamics, and semantic-physical entanglement in text-conditioned contrastive learning induces gradient conflicts that violate fundamental physical laws. To resolve this, the authors propose DiReCT, a framework that formally characterizes the gradient conflict arising from semantic-physical coupling and introduces dual-scale decoupled contrastive learning during post-training. This approach constructs semantically decoupled negative samples at both macroscopic and microscopic scales, leverages large language model–guided perturbations along physical axes to generate hard negatives, and incorporates velocity-space distribution regularization. Evaluated on the Wan 2.1-1.3B model, DiReCT improves VideoPhy physical commonsense scores by 16.7% over the baseline and 11.3% over supervised fine-tuning, achieving significantly enhanced physical consistency without compromising visual quality or increasing training cost.

Technology Category

Application Category

📝 Abstract

Flow-matching video generators produce temporally coherent, high-fidelity outputs yet routinely violate elementary physics because their reconstruction objectives penalize per-frame deviations without distinguishing physically consistent dynamics from impossible ones. Contrastive flow matching offers a principled remedy by pushing apart velocity-field trajectories of differing conditions, but we identify a fundamental obstacle in the text-conditioned video setting: semantic-physics entanglement. Because natural-language prompts couple scene content with physical behavior, naive negative sampling draws conditions whose velocity fields largely overlap with the positive sample's, causing the contrastive gradient to directly oppose the flow-matching objective. We formalize this gradient conflict, deriving a precise alignment condition that reveals when contrastive learning helps versus harms training. Guided by this analysis, we introduce DiReCT (Disentangled Regularization of Contrastive Trajectories), a lightweight post-training framework that decomposes the contrastive signal into two complementary scales: a macro-contrastive term that draws partition-exclusive negatives from semantically distant regions for interference-free global trajectory separation, and a micro-contrastive term that constructs hard negatives sharing full scene semantics with the positive sample but differing along a single, LLM-perturbed axis of physical behavior; spanning kinematics, forces, materials, interactions, and magnitudes. A velocity-space distributional regularizer helps to prevent catastrophic forgetting of pretrained visual quality. When applied to Wan 2.1-1.3B, our method improves the physical commonsense score on VideoPhy by 16.7% and 11.3% compared to the baseline and SFT, respectively, without increasing training time.

Problem

Research questions and friction points this paper is trying to address.

physics-refined video generation

semantic-physics entanglement

contrastive learning

flow matching

physical commonsense

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled Regularization

Contrastive Trajectories

Physics-Refined Video Generation