🤖 AI Summary
Existing first-frame propagation methods struggle to achieve efficient and controllable general video editing due to limitations imposed by small-scale, low-resolution training data and reliance on runtime guidance. This work proposes a novel guidance-free paradigm for first-frame propagation, introducing FFP-300K—a large-scale dataset comprising 300,000 pairs of high-fidelity 720p videos, each 81 frames long. To effectively disentangle appearance and motion information, the approach integrates self-distillation training, identity propagation regularization, and an adaptive spatiotemporal RoPE (AST-RoPE) positional encoding. The proposed method substantially enhances editing generalization and temporal consistency, achieving improvements of approximately 0.2 in PickScore and 0.3 in VLM score on EditVerseBench, outperforming both current academic and commercial models.
📝 Abstract
First-Frame Propagation (FFP) offers a promising paradigm for controllable video editing, but existing methods are hampered by a reliance on cumbersome run-time guidance. We identify the root cause of this limitation as the inadequacy of current training datasets, which are often too short, low-resolution, and lack the task diversity required to teach robust temporal priors. To address this foundational data gap, we first introduce FFP-300K, a new large-scale dataset comprising 300K high-fidelity video pairs at 720p resolution and 81 frames in length, constructed via a principled two-track pipeline for diverse local and global edits. Building on this dataset, we propose a novel framework designed for true guidance-free FFP that resolves the critical tension between maintaining first-frame appearance and preserving source video motion. Architecturally, we introduce Adaptive Spatio-Temporal RoPE (AST-RoPE), which dynamically remaps positional encodings to disentangle appearance and motion references. At the objective level, we employ a self-distillation strategy where an identity propagation task acts as a powerful regularizer, ensuring long-term temporal stability and preventing semantic drift. Comprehensive experiments on the EditVerseBench benchmark demonstrate that our method significantly outperforming existing academic and commercial models by receiving about 0.2 PickScore and 0.3 VLM score improvement against these competitors.