STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Video generation still faces two key bottlenecks: semantic collapse of sparse motion cues (e.g., 2D sketches) after encoding, and temporal instability arising from entangled appearance-motion modeling that favors texture over coherent dynamics. To address these, we propose an instance-guided dense 2.5D motion field modeling framework. First, we synthesize a dense motion field from sparse inputs augmented with instance-level cues, enhanced by monocular depth and mask guidance to improve geometric plausibility. Second, we introduce Dense RoPE—a spatially addressable rotary positional encoding in Transformers—that explicitly anchors keyframe motion features and decouples structural evolution from appearance rendering. Our method requires no per-frame trajectory annotations, significantly strengthening the fidelity and effectiveness of motion guidance. Experiments demonstrate superior long-range dynamic structure consistency and higher visual fidelity in image-to-video synthesis.

Technology Category

Application Category

📝 Abstract

Video generation has recently made striking visual progress, but maintaining coherent object motion and interactions remains difficult. We trace two practical bottlenecks: (i) human-provided motion hints (e.g., small 2D maps) often collapse to too few effective tokens after encoding, weakening guidance; and (ii) optimizing for appearance and motion in a single head can favor texture over temporal consistency. We present STANCE, an image-to-video framework that addresses both issues with two simple components. First, we introduce Instance Cues -- a pixel-aligned control signal that turns sparse, user-editable hints into a dense 2.5D (camera-relative) motion field by averaging per-instance flow and augmenting with monocular depth over the instance mask. This reduces depth ambiguity compared to 2D arrow inputs while remaining easy to use. Second, we preserve the salience of these cues in token space with Dense RoPE, which tags a small set of motion tokens (anchored on the first frame) with spatial-addressable rotary embeddings. Paired with joint RGB (+) auxiliary-map prediction (segmentation or depth), our model anchors structure while RGB handles appearance, stabilizing optimization and improving temporal coherence without requiring per-frame trajectory scripts.

Problem

Research questions and friction points this paper is trying to address.

Enhancing motion coherence in video generation

Addressing sparse motion guidance collapse in encoding

Separating appearance and motion optimization for consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dense 2.5D motion field from sparse user hints

Spatial-addressable rotary embeddings for motion tokens

Joint RGB and auxiliary map prediction for coherence

🔎 Similar Papers

MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion

2024-10-10arXiv.orgCitations: 0

Generalizable Implicit Motion Modeling for Video Frame Interpolation

2024-07-11Neural Information Processing SystemsCitations: 0

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

AI Research Scientist, Computer Vision - Facebook Video Intelligence