STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video generation still faces two key bottlenecks: semantic collapse of sparse motion cues (e.g., 2D sketches) after encoding, and temporal instability arising from entangled appearance-motion modeling that favors texture over coherent dynamics. To address these, we propose an instance-guided dense 2.5D motion field modeling framework. First, we synthesize a dense motion field from sparse inputs augmented with instance-level cues, enhanced by monocular depth and mask guidance to improve geometric plausibility. Second, we introduce Dense RoPE—a spatially addressable rotary positional encoding in Transformers—that explicitly anchors keyframe motion features and decouples structural evolution from appearance rendering. Our method requires no per-frame trajectory annotations, significantly strengthening the fidelity and effectiveness of motion guidance. Experiments demonstrate superior long-range dynamic structure consistency and higher visual fidelity in image-to-video synthesis.

Technology Category

Application Category

📝 Abstract
Video generation has recently made striking visual progress, but maintaining coherent object motion and interactions remains difficult. We trace two practical bottlenecks: (i) human-provided motion hints (e.g., small 2D maps) often collapse to too few effective tokens after encoding, weakening guidance; and (ii) optimizing for appearance and motion in a single head can favor texture over temporal consistency. We present STANCE, an image-to-video framework that addresses both issues with two simple components. First, we introduce Instance Cues -- a pixel-aligned control signal that turns sparse, user-editable hints into a dense 2.5D (camera-relative) motion field by averaging per-instance flow and augmenting with monocular depth over the instance mask. This reduces depth ambiguity compared to 2D arrow inputs while remaining easy to use. Second, we preserve the salience of these cues in token space with Dense RoPE, which tags a small set of motion tokens (anchored on the first frame) with spatial-addressable rotary embeddings. Paired with joint RGB (+) auxiliary-map prediction (segmentation or depth), our model anchors structure while RGB handles appearance, stabilizing optimization and improving temporal coherence without requiring per-frame trajectory scripts.
Problem

Research questions and friction points this paper is trying to address.

Enhancing motion coherence in video generation
Addressing sparse motion guidance collapse in encoding
Separating appearance and motion optimization for consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dense 2.5D motion field from sparse user hints
Spatial-addressable rotary embeddings for motion tokens
Joint RGB and auxiliary map prediction for coherence
🔎 Similar Papers