SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing instruction-guided video editing methods struggle to simultaneously achieve precise semantic modification and faithful motion preservation, and their reliance on external priors limits generalization. This work proposes a factorized framework that decouples semantics and motion: it first performs structural planning by jointly predicting semantic tokens and latent variables from sparse anchor frames, then internalizes temporal dynamics through self-supervised motion alignment pretraining. Remarkably, the model acquires strong zero-shot editing capabilities using only unpaired data for pretraining—without any external priors. Combined with pretraining tasks such as cuboid inpainting and velocity perturbation, followed by supervised fine-tuning, the method achieves state-of-the-art performance among open-source systems, matching the quality of commercial solutions like Kling-Omni while demonstrating exceptional zero-shot generalization.

Technology Category

Application Category

📝 Abstract

Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.

Problem

Research questions and friction points this paper is trying to address.

instruction-guided video editing

semantic modification

motion preservation

video editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Anchoring

Motion Alignment

Factorized Representation

Zero-shot Video Editing