🤖 AI Summary
In video editing, preserving the motion trajectory of the edited subject while maintaining editing fidelity remains challenging: existing methods either over-rely on layout constraints or model motion only implicitly, leading to motion distortions. This paper introduces *anchor tokens*—a compact, semantically aligned, point-based motion representation that explicitly models dynamic content via sparse keypoint trajectories and supports flexible repositioning. We are the first to integrate anchor tokens into video diffusion models, leveraging self-attention to unsupervisedly extract keypoint trajectories from raw video and employing differentiable point transformations for motion-guided editing and structural alignment with target poses. Without requiring manual annotations, our approach significantly improves motion coherence and semantic consistency. It achieves state-of-the-art performance in both editing accuracy and motion fidelity across multiple benchmarks.
📝 Abstract
Accurately preserving motion while editing a subject remains a core challenge in video editing tasks. Existing methods often face a trade-off between edit and motion fidelity, as they rely on motion representations that are either overfitted to the layout or only implicitly defined. To overcome this limitation, we revisit point-based motion representation. However, identifying meaningful points remains challenging without human input, especially across diverse video scenarios. To address this, we propose a novel motion representation, anchor tokens, that capture the most essential motion patterns by leveraging the rich prior of a video diffusion model. Anchor tokens encode video dynamics compactly through a small number of informative point trajectories and can be flexibly relocated to align with new subjects. This allows our method, Point-to-Point, to generalize across diverse scenarios. Extensive experiments demonstrate that anchor tokens lead to more controllable and semantically aligned video edits, achieving superior performance in terms of edit and motion fidelity.