Point-to-Point: Sparse Motion Guidance for Controllable Video Editing

📅 2025-11-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In video editing, preserving the motion trajectory of the edited subject while maintaining editing fidelity remains challenging: existing methods either over-rely on layout constraints or model motion only implicitly, leading to motion distortions. This paper introduces *anchor tokens*—a compact, semantically aligned, point-based motion representation that explicitly models dynamic content via sparse keypoint trajectories and supports flexible repositioning. We are the first to integrate anchor tokens into video diffusion models, leveraging self-attention to unsupervisedly extract keypoint trajectories from raw video and employing differentiable point transformations for motion-guided editing and structural alignment with target poses. Without requiring manual annotations, our approach significantly improves motion coherence and semantic consistency. It achieves state-of-the-art performance in both editing accuracy and motion fidelity across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Accurately preserving motion while editing a subject remains a core challenge in video editing tasks. Existing methods often face a trade-off between edit and motion fidelity, as they rely on motion representations that are either overfitted to the layout or only implicitly defined. To overcome this limitation, we revisit point-based motion representation. However, identifying meaningful points remains challenging without human input, especially across diverse video scenarios. To address this, we propose a novel motion representation, anchor tokens, that capture the most essential motion patterns by leveraging the rich prior of a video diffusion model. Anchor tokens encode video dynamics compactly through a small number of informative point trajectories and can be flexibly relocated to align with new subjects. This allows our method, Point-to-Point, to generalize across diverse scenarios. Extensive experiments demonstrate that anchor tokens lead to more controllable and semantically aligned video edits, achieving superior performance in terms of edit and motion fidelity.
Problem

Research questions and friction points this paper is trying to address.

Accurately preserving motion during video editing remains challenging
Existing methods struggle with edit-motion fidelity trade-offs
Identifying meaningful motion points without human input is difficult
Innovation

Methods, ideas, or system contributions that make the work stand out.

Anchor tokens encode essential motion patterns compactly
Point trajectories enable flexible relocation for new subjects
Method generalizes across diverse video editing scenarios
🔎 Similar Papers
No similar papers found.