Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address motion structure distortion and physically implausible dynamics in video generation for articulated and deformable objects (e.g., humans, animals), this paper proposes a structure-aware co-modeling framework. Methodologically, it introduces a novel bidirectional feature fusion module that distills global motion priors via the SAM2 tracking model; designs a local Gram Flow loss to explicitly constrain coordinated local deformations; and integrates an autoregressive SAM2 tracker with the CogVideoX bidirectional diffusion model, incorporating a geometry-aware motion alignment loss. Experiments demonstrate significant improvements: VBench score reaches 95.51% (+2.60% over REPA), FVD drops to 360.57 (21.2% and 22.5% lower than REPA and LoRA, respectively), and human preference achieves 71.4%. The framework substantially enhances motion fidelity and physical plausibility in generated videos.

Technology Category

Application Category

📝 Abstract

Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60% on VBench, 21-22% lower FVD, and 71.4% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51%, surpassing REPA (92.91%) by 2.60%, and reduce FVD to 360.57, a 21.20% and 22.46% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at https://sam2videox.github.io/ .

Problem

Research questions and friction points this paper is trying to address.

Generating realistic motion preserving object structure in videos.

Addressing physically implausible transitions in articulated objects.

Improving motion priors over noisy optical flow or skeletons.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills motion priors from autoregressive tracking model

Uses bidirectional feature fusion for global structure

Introduces Local Gram Flow loss for local feature alignment

🔎 Similar Papers

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion